Chapter 5. Deployment

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Deployment

This chapter covers

Understanding the nature of failure in complex systems
Developing a simple mathematical model of failure
Using frequent low-cost failure to avoid infrequent high-cost failure
Using continuous delivery to measure and manage risk
Understanding the deployment patterns for microservices

The organizational decision to adopt the microservice architecture often represents an acceptance that change is necessary and that current work practices aren’t delivering. This is an opportunity not only to adopt a more capable software architecture but also to introduce a new set of work practices for that architecture.

You can use microservices to adopt a scientific approach to risk management. Microservices make it easier to measure and control risk, because they give you small units of control. The reliability of your production system then becomes quantifiable, allowing you to move beyond ineffective manual sign-offs as the primary risk-reduction strategy. Because traditional processes regard software as a bag of features that are either broken or fixed, and don’t incorporate the concept of failure thresholds and failure rates, they’re much weaker protections against failure.^[1]

¹
To be bluntly cynical, traditional practices are more about territorial defense and blame avoidance than building effective software.

5.1. Things fall apart

All things fail catastrophically. There’s no gradual decline. There’s decline, certainly; but when death comes, it comes quickly. Structures need to maintain a minimum level of integrity before they fall apart. Cross that threshold, and the essence is gone.

This is more than poetic symbolism. Disorder always increases.^[2] Systems can tolerate some disorder and can even convert chaos into order in the short term; but in the long run, we’re all dead, because disorder inevitably forces the system over the threshold of integrity into failure.

²
There are more ways to be disorganized than there are ways to be organized. Any given change is more likely to move you further into disorder.

What is failure? From the perspective of enterprise software, this question has many answers. Most visible are the technical failures of the system to meet uptime requirements, feature requirements, and acceptable performance and defect levels. Less visible, but more important, are failures to meet business goals.

Organizations obsess about technical failures, often causing business failures as a result. The argument of this chapter is that it’s better to accept many small failures in order to prevent large-scale catastrophic failures. It’s better for 5% of users to see a broken web page than for the business to go bankrupt because it failed to compete in the marketplace.

The belief that software systems can be free from defects and that this is possible through sheer professionalism is pervasive in the enterprise. There’s an implicit assumption that perfect software can be built at a reasonable cost. This belief ignores the basic dynamics of the law of diminishing marginal returns: the cost of fixing the next bug grows ever higher and is unbounded. In practice, all systems go into production with known defects. The danger of catastrophe comes from an institutional consensus that it’s best to pretend that this isn’t the case.

Can the microservice architecture speak to this problem? Yes, because it makes it easier to reduce the risk of catastrophic failure by allowing you to make small changes that have low impact. The introduction of microservices also provides you, as an architect, with the opportunity to reframe the discussion around acceptable failure rates and risk management. Unfortunately, there’s no forcing function, and microservice deployments can easily become mired in the traditional risk management approach of enterprise operations. It’s therefore essential to understand the possibilities for risk reduction that the architecture creates.

5.2. Learning from history

To understand how software systems fail and how you can improve deployment, you need to understand how other complex systems fail. A large-scale software system is not unlike a large-scale engineering system. There are many components interacting in many ways. With software, you have the additional complication of deployment—you keep changing the system. With something like a nuclear power plant, at least you only build it once. Let’s start by examining just such a complex system in production.

5.2.1. Three Mile Island

On March 28, 1979, the second unit of the nuclear power plant located on Three Mile Island near Harrisburg, Pennsylvania, suffered a partial meltdown, releasing radioactive material into the atmosphere.^[3]

³
For full details, see John G. Kemeny et al., Report of the President’s Commission on the Accident at Three Mile Island (U.S. Government Printing Office, 1979), http://mng.bz/hwAm.

The accident was blamed on operator error. From a complex systems perspective, this conclusion is neither fair nor useful. With complex systems, failure is inevitable. The question isn’t, “Is nuclear energy safe?” but rather, “What levels of accidents and contamination can we live with?” This is also the question we should ask of software systems.

To understand what happened at Three Mile Island, you need to know how a reactor works at a high level, and at a low level where necessary. Your skills as a software architect will serve you well in understanding the explanation that follows. The reactor heats water, turning it into steam. The steam drives a turbine that spins to produce electricity. The reactor heats the water using a controlled fission reaction. The nuclear fuel, uranium, emits neutrons that collide with other uranium atoms, releasing even more neutrons. This chain reaction must be controlled through the absorption of excess neutrons; otherwise, bad things happen.

The uranium fuel is stored in a large, sealed, stainless steel containment vessel, about the height of a three-story building. The fuel is stored as vertical rods, about the height of a single story. Interspersed are control rods made of graphite, that absorb neutrons; to control the reaction, you raise and lower the control rods. The reaction can be completely stopped by lowering all the control rods fully; this is known as scramming. This is an obvious safety feature: if there’s a problem, pretty much any problem, drop the rods!^[4] Nuclear reactors are designed with many such automatic safety devices (ASDs) that activate without human intervention on the basis of input signals from sensors. I’m sure you can already see the opportunity for unintended cascading behavior in the ASDs.

⁴
The technical term scram comes from the early days of research reactors. If anything went wrong, you dropped the rods, shouted “Scram!”, and ran. Very fast.

The heat from the core (all the stuff inside the containment vessel, including the rods) is extracted using water. This coolant water is radioactive, so you can’t use it directly to drive the turbine. You have to use a heat exchanger to transfer the heat to a set of completely separate water pipes; that water, which isn’t radioactive, drives the turbine. You have a primary coolant system with radioactive water and a secondary coolant system with “normal” water. Everything is under high pressure and at a high temperature, including the turbine, which is cooled by the secondary system. The secondary water must be very pure and contain almost no microscopic particles, to protect the turbine blades, which are precision engineered. Observe how complexity lives in the details: a simple fact—that water drives the turbine—hides the complexity that it must be “special” purified water. Time for a high-level diagram: see figure 5.1.

Figure 5.1. High-level components of a nuclear reactor

Now let’s go a little deeper. That special purified water for the secondary system doesn’t happen by magic: you need something called a condensate polisher to purify the water, using filters. Like many parts of the system, the condensate polisher’s valves, which allow water to enter and leave, are driven by compressed air. That means the plant, in addition to having water pipes for the primary and secondary cooling systems, also has compressed air pipes for a pneumatic system. Where does the special purified water come from? Feed pumps are used to pump water from a local water source—in this case, the Susquehanna River—into the cooling system. There are also emergency tanks, with emergency feed pumps, in case the main feed pumps fail. The valves for these are also driven by the pneumatic system.

We must also consider the core, which is filled with high-temperature radioactive water under high pressure.^[5] High-pressure water is extremely dangerous and can damage the containment vessel, and the associated pipework, leading to a dreaded loss-of-containment accident (LOCA). You don’t want holes in the containment vessel. To alleviate water pressure in the core, a pressurizer is used. This is a large water tank connected to the core and filled about half and half with water and steam. The pressurizer also has a drain, which allows water to be removed from the core. The steam at the top of the pressurizer tank is compressible and acts as a shock absorber. You can control the core pressure by controlling the volume of the water in the lower half of the pressurizer. But you must never, ever allow the water level to reach 100% (referred to as going solid): if you do, you’ll have no steam and no shock absorber, and the result will be a LOCA because pipes will burst. This fact is drilled into operators from day one. Figure 5.2 shows the expanded diagram.

⁵
What fun!

Figure 5.2. A small subset of interactions between high and low levels in the reactor

The timeline of the accident

At 4:00 a.m., the steam turbine tripped: it stopped automatically because the feed pumps for the secondary cooling system that cools the turbine had stopped. With no water entering the turbine, the turbine was in danger of overheating—and it’s programmed to stop under these conditions. The feed pumps had stopped because the pneumatic air system that drives the valves for the pumps became contaminated with water from the condensate polisher. A leaky seal in the condensate polisher had allowed some of the water to escape into the pneumatic system. The end result was that a series of ASDs, operating as designed, triggered a series of ever-larger failures. More was to come.

With the turbine down and no water flowing in the secondary coolant system, no heat could be extracted from the primary coolant system. The core couldn’t be cooled. Such a situation is extremely dangerous and, if not corrected, will end with a meltdown.

There was an ASD for this scenario: emergency feed pumps take water from an emergency tank. The emergency pumps kick in automatically. Unfortunately, in this case, the pipes to the emergency pumps were blocked, because two valves had been accidentally left closed during maintenance. Thus the emergency pumps supplied no water. The complexity of the system as a whole, and its interdependencies, are apparent here; not just the machinery, but also its management and maintenance, are part of the dependency relationship graph.

The system had now entered a cascading failure mode. The steam turbine boiled dry. The reactor scrammed automatically, dropping all the control rods to stop the fission reaction. This didn’t reduce the heat to safe levels, however, because the decay products from the reaction still needed to cool down. Normally this takes several days and requires a functioning cooling system. With no cooling, very high temperatures and pressures built up in the containment vessel, which was in danger of breaching.

Naturally, there are ASDs for this scenario. A relief valve, known as the pilot-operated relief valve (PORV), opens under high pressure and allows the core water to expand into the pressurizer vessel. But the PORV is unreliable: valves for high-pressure radioactive water are, unsurprisingly, unreliable, failing about 1 in 50 times. In this case, the PORV opened in response to the high-pressure conditions but then failed to close fully after the pressure was relieved. It’s important for operators to know the status of the PORV, and this one had recently been fitted with a status sensor and indicator. This sensor also failed, though, leading the operators to believe that the PORV was closed. The reactor was now under a LOCA, and ultimately more than one third of the primary cooling water drained away. The actual status of the PORV wasn’t noticed until a new shift of operators started.

As water drained away, pressure in the core fell, but by too much. Steam pockets formed. These not only block water flow but also are far less efficient at removing heat. So, the core continued to overheat. At this point, Three Mile Island was only 13 seconds into the accident, and the operators were completely unaware of the LOCA; they saw only a transient pressure spike in the core. Two minutes into the event, pressure dropped sharply as core coolant turned to steam. At this point, the fuel rods in the core were in danger of becoming exposed, because there was barely sufficient water to cover them. Another ASD kicked in—injection of cold, high-pressure water. This is a last resort to save the core by keeping it covered. The problem is that too much cold water can crack the containment vessel. Also, and far worse, too much water makes the pressurizer go solid. Without a pressure buffer, pipes will crack. So the operators, as they had been trained to do, slowed the cold-water injection rate.^[6]

⁶
Notice that the operators were using a mental model that had diverged from reality. Much the same happens with the operation of software systems under high load.

The core became partially exposed as a result, and a partial meltdown occurred. Although the PORV was eventually closed and the water was brought under control, the core was badly damaged. Chemical reactions inside the core led to the release of hydrogen gas, which caused a series of explosions; ultimately, radioactive material was released into the atmosphere.^[7]

⁷
An excellent analysis of this accident, and many others, can be found in the book Normal Accidents (Princeton University Press, 1999) by Charles Perrow. This book also develops a failure model for complex systems that’s relevant to software systems.

Learning from the accident

Three Mile Island is one of the most-studied complex systems accidents. Some have blamed the operators, who “should” have understood what was happening, “should” have closed the valves after maintenance, and “should” have left the high-pressure cold-water injection running.^[8] Have you ever left your home and not been able to remember whether you locked the front door? Imagine having 500 front doors—on any given day, in any given reactor, some small percentage of valves will be in the wrong state.

⁸
Or should they? Doing so might have cracked the containment vessel, causing a far worse accident. Expert opinion is conflicted on this point.

Others blame the accident on sloppiness in the management culture, saying there “should” have been lock sheets for the valves. But since then, adding more paperwork to track work practices has only reduced valve errors in other reactors, not eliminated them. Still others blame the design of the reactor: too much complexity, coupling, and interdependence. A simpler design has fewer failure modes. But it’s in the nature of systems engineering that there’s always hidden complexity, and the final version of initially simple designs becomes complex as a result.

These judgments aren’t useful, because they’re all obvious and true to some degree. The real learning is that complex systems are fragile and will fail. No amount of safety devices and procedures will solve this problem, because the safety devices and procedures are part of the problem. Three Mile Island made this clear: the interactions of all the components of the system (including the humans) led to failure.

There’s a clear analogy to software systems. We build architectures that have a similar degree of complexity and the same kinds of interactions and tight couplings. We try to add redundancy and fail-safes but find that these measures fail anyway, because they haven’t been sufficiently tested. We try to control risk with detailed release procedures and strict quality assurance. Supposedly, this gives us predictable and safe deployments; but in practice, we still end up having to do releases on the weekend, because strict procedures aren’t that effective. In one way, we’re worse than nuclear reactors—with every release, we change fundamental core components!

You can’t remove risk by trying to contain complexity. Eventually, you’ll have a LOCA.

5.2.2. A model for failure in software systems

Let’s try to understand the nature of failure in software systems using a simple model. You need to quantify your exposure to risk so that you can understand how different levels of complexity and change affect a system.

A software system can be thought of as a set of components, with dependency relationships between those components. The simplest case is a single component. Under what conditions does the component, and thus the entire system, fail?

To answer that question, we should clarify the term failure. In this model, failure isn’t an absolute binary condition, but a quantity you can measure. Success might be 100% uptime over a given period, and failure might be any uptime less than 100%. But you might be happy with a failure rate of 1%, setting 99% uptime as the threshold of success. You could count the number of requests that have correct responses: out of every 1,000 requests, perhaps 10 fail, yielding a failure rate of 1%. Again, you might be happy with this. Loosely, we can define the failure rate as the proportion of some quantity (continuous or discrete) that fails to meet a specific threshold value. Remember that you’re building as simple a model as you can, so what the failure rate is a failure of is excluded from the model. You only care about the rate and meeting the threshold. Failure means failure to meet the threshold, not failure to operate.

For a one-component system, if the component has a failure rate of 1%, then the system as a whole has a failure rate of 1% (see figure 5.3). Is the system failing?

Figure 5.3. A single-component system, where P₀ is the failure rate of component C₀

If the acceptable failure threshold is 0.5%, then the system is failing. If the acceptable failure threshold is 2%, then the system is not failing: it’s succeeding, and your work is finished.

This model reflects an important change of perspective: accepting that software systems are in a constant state of low-level failure. There’s always a failure rate. Valves are left closed. The system as a whole fails only when a threshold of pain is crossed. This new perspective is different from the embedded organizational assumption that software can be perfect and operate without defects. The obsession with tallying defective features seems quaint from this viewpoint. Once you gain this perspective, you can begin to understand how the operational costs of the microservice architecture are outweighed by the benefit of superior risk management.

A two-component system

Now, consider a two-component system (see figure 5.4). One component depends on the other, so both must function correctly for the system to succeed. Let’s set the failure threshold at 1%. Perhaps this is the proportion of failed purchases, or maybe you’re counting many different kinds of errors; this isn’t relevant to the model. Let’s also make the simplifying assumption that both components fail independently of each other.^[9] The failure of one doesn’t make the other more likely to fail. Each component has its own failure rate. In this system, a given function can succeed only if both components succeed.

⁹
This assumption is important to internalize. Components are like dice: they don’t affect each other, and they have no memory. If one component fails, it doesn’t make another component more likely to fail. It may cause the other component to fail, but that’s different, because the failure has an external cause. We’re concerned with internal failure, independent of other components.

Figure 5.4. A two-component system, where P_i is the failure rate of component C_i

Because the components fail independently, the rules of probability say that you can multiply the probabilities. There are four cases: both fail, both succeed, the first fails and the second succeeds, or the first succeeds and the second fails. You want to know the failure rate of the system; this is the same as asking for the probability that a given transaction will fail. Of the four cases, three are failing, and one is success. This makes the calculation easier: multiply the success probabilities together to get the probability for the case where the entire system succeeds. The failure probability is found by subtracting the success probability from 1.^[10] Keeping the numbers simple, assume that each component has the same failure probability of 1%. The gives an overall failure probability of 1 – (99% x 99%) = 1 - 98.01% = 1.99%.

¹⁰
The system can be in only one of two states: success or failure. The probabilities of both must sum to 1. This means you can find one state if you can find the other, so you get to choose the one with the easier formula.

Despite the fact that both components are 99% reliable, the system as a whole is only 98% reliable and fails to meet the success threshold of 99%. You can begin to see that meeting an overall level of system reliability—where that system is composed of components, all of which are essential to operation—is harder than it looks. Each component must be a lot more reliable than the system as a whole.

Multiple components

You can extend this model to any number of components, as long as the components depend on each other in a serial chain. This is a simplification from the real software architectures we know and love, but let’s work with this simple model to build some understanding of failure probabilities. Using the assumption of failure independence, where you can multiply the probabilities together, yields the following formula for the overall probability of failure for a system with an arbitrary number of components in series.

Here, P_F is the probability of system failure, n is the number of components, and P_i is the probability that component i fails.

If you chart this formula against the number of components in the system, as shown in figure 5.5, you can see that the probability of failure grows quickly with the number of components. Even though each component is reliable at 99% (we’ve given each component the same reliability to keep things simple), the system is unreliable. For example, reading from the chart, a 10-component system has just under a 10% failure rate. That’s a long way from the desired 1%.

Figure 5.5. Probability of system failure against the number of components, where all components are 99% reliable

The model demonstrates that intuitions about reliability can often be incorrect. A convoy of ships is as slow as its slowest ship, but a software architecture isn’t as unreliable as its most unreliable component—it’s much more unreliable, because the other components can fail, too.

The system in the Three Mile Island reactor definitely wasn’t linear. It consisted of a complicated set of components with many interdependencies. Real software is much like Three Mile Island, and software components tend to be even more tightly coupled, with no tolerance for errors. Let’s extend the model to see how this affects reliability. Consider a system with four components, one of which is a subcomponent that isn’t on the main line (see figure 5.6). Three have a serial dependency, but the middle component also depends on the fourth.

Figure 5.6. A nonlinear four-component system

Again, each of the four components has the same 99% reliability. How reliable is the system as a whole? You can solve the serial case with the formula introduced earlier. The reliability of the middle component must take into account its dependency on the fourth component. This is a serial system as well, contained inside the main system. It’s a two-component system, and you’ve seen that this has a reliability of 100% - 1.99% = 98.01%. Thus, the failure probability of the system as a whole is 1 – (99% x 98.01% x 99%) = 1 – 96.06% = 3.94%.

What about an arbitrary system with many dependencies, or systems where multiple components depend on the same subcomponent? You can make another simplifying assumption to handle this case: assume that all components are necessary, and there are no redundancies. Every component must work. This seems unfair, but think of how the Three Mile Island accident unfolded. Supposedly redundant systems such as the emergency feed pumps turned out to be important as standalone components. Yes, the reactor could work without them, but it was literally an accident waiting to happen.

If all components are necessary, then the dependency graph can be ignored. Every component is effectively on the main line. It’s easy to overlook subcomponents or assume that they don’t affect reliability as much, but that’s a mistake. Interconnected systems are much more vulnerable to failure than you may think, because there are many subcomponent relationships. The humans who run and build the system are one such subcomponent relationship—you can only blame “human error” for failures if you consider humans to be part of the system. Ignoring the dependency graph only gives you a first-order approximation of the failure rate, using the earlier formula; but given how quickly independent probabilities compound, that estimate is more than sufficient.

5.2.3. Redundancy doesn’t do what you think it does

You can make your systems more reliable by adding redundancy. Instead of one instance of a component that might fail, have many. Keeping to the simple model where failures are independent, this makes the system much more reliable. To calculate the failure probability of a set of redundant components, you multiply the individual failure probabilities, because all must fail in order for the entire assemblage to fail.^[11] Now, you find that probability theory is your friend. In the single-component system, adding a second redundant component gives you a failure rate of 1% x 1% = 0.01%.

¹¹
The failure probability formula, in this case, is .

It seems that all you need to do is add lots of redundancy, and your problems go away. Unfortunately, this is where the simple model breaks down. There are few failure modes in a software system where failure of one instance of a component is independent of other components of the same kind. Yes, individual host machines can fail,^[12] but most failures affect all software components equally. The data center is down. The network is down. The same bug applies to all instances. High load causes instances to fall like dominoes, or to flap.^[13] A deployment of a new version fails on production traffic.

¹²
Physical power supplies fail all the time, and so do hard drives; network engineers will keep stepping on cables from now until the end of the universe; and we’ll never solve the Halting Problem (it’s mathematically impossible to prove that any given program will halt instead of executing forever—you can thank Mr. Alan Turing for that), so there will always be input that triggers infinite loops.

¹³
Flapping occurs when services keep getting killed and restarted by the monitoring system. Under high load, newly started services are still cold (they have empty caches), and their tardiness in responding to requests is interpreted as failure, so they’re killed. Then more services are started. Eventually, there are no services that aren’t either starting or stopping, and work comes to a halt.

Simple models are also useful when they break, because they can reveal hidden assumptions. Load balancing over multiple instances doesn’t give you strong redundancy, it gives you capacity. It barely moves the reliability needle, because multiple instances of the same component are not independent.^[14]

¹⁴
The statement that multiple instances of the same software component don’t fail independently is proposed as an empirical fact from the observed behavior of real systems; it isn’t proposed as a mathematical fact.

Automatic safety devices are unreliable

Another way to reduce the risk of component failure is to use ASDs. But as you saw in the story of Three Mile Island, these bring their own risks. In the model, they’re additional components that can themselves fail.

Many years ago, I worked on a content-driven website. The site added 30 or 40 news stories a day. It wasn’t a breaking news site, so a small delay in publishing a story was acceptable. This gave me the brilliant idea to build a 60-second cache. Most pages could be generated once and cached for 60 seconds. Once expired, any news updates appeared on the regenerated pages, and the next 60-second caching period would begin.

This seemed like a cheap way to build what was effectively an ASD for high load. The site would be able to handle things like election day results without needing to increase server capacity much.

The 60-second cache was implemented as an in-memory cache on each web server. It was load tested, and everything appeared to be fine. But in production, servers kept crashing. Of course, there was a memory leak; and, of course, it didn’t manifest unless we left the servers running for at least a day, storing more than 1,440 copies of each page, for each article, in memory. The first week we went live was a complete nightmare—we babysat dying machines on a 24/7 rotation.

5.2.4. Change is scary

Let’s not throw out the model just yet. Software systems aren’t static, and they suffer from catastrophic events known as deployments. During a deployment, many components are changed simultaneously. In many systems, this can’t be done without downtime. Let’s model this as a simultaneous change of a random subset of components. What does this do to the reliability of the system?

By definition, the reliability of a component is the measured rate of failure in production. If a given component drops only 1 work item in 100, it has 99% reliability. Once a deployment is completed and has been live for a while, you can measure production to get the reliability rate. But this isn’t much help in advance. You want to know the probability of failure of the new system before the changes are made.

Our model isn’t strong enough to provide a formula for this situation. But you can use another technique: Monte Carlo simulation. You run lots of simulations of the deployment and add up the numbers to see what happens. Let’s use a concrete example. Assume you have a four-component system, and the new deployment consists of updates to all four components. In a static state, before deployment, the reliability of the system is given by the standard formula: 0.99⁴ = .9605 = 96.1%.

To calculate the reliability after deployment, you need to estimate the actual reliabilities of each component. Because you don’t know what they are, you have to guess them. Then you run the formula using the guesses.

If you do this many times, you’ll be able to plot the distribution of system reliability. You can say things like, “In 95% of simulations, the system has at least 99% reliability. Deploy!” Or, “In only 1% of simulations, the system has at least 99% reliability. Unplanned downtime ahead!” Bear in mind that these numbers are just for discussion; you’ll need to decide on numbers that reflect your organization’s risk tolerance.

How do you guess the reliability of a component? You need to do this in a way that makes the simulation useful. Reliability isn’t normally distributed, like a person’s height.^[15] Reliability is skewed because components are mostly reliable—most components come in around 99% and can’t go much higher. There’s a lot of space below 99% in which to fail. Your team is doing unit testing, staging, code reviews, and so on. The QA department has to sign off on releases, and the head of QA is pretty strict. There’s a high probability that your components are, in fact, reliable; but you can’t test for everything, and production is a crueler environment than a developer’s laptop or a staging system.

¹⁵
The normal distribution assumes that any given instance will be close to the average and has as much chance of being above average as below.

You can use a skewed probability distribution^[16] to model “mostly reliable.” The chart in figure 5.7 shows how the failure probabilities are distributed. To make a guess, pick a random number between 0 and 1, and plot its corresponding probability. You can see that most guesses will give a low failure probability.

¹⁶
The Pareto distribution is used in this example, because it’s a good model for estimating failure events.

Figure 5.7. A skewed estimator of failure probability

For each of the four components, you get a reliability estimate. Multiply these together in the usual manner. Now, do this many times; over many simulation runs, you can chart the reliability of the system. Figure 5.8 shows the output from a sample exercise.^[17] Although the system is often fairly reliable, it has mostly poor reliability compared to a static system. In only 0.15% of simulations does the system have reliability of 95% or more.

¹⁷
In the sample exercise, 1,000 runs were executed and then categorized into 5% intervals.

Figure 5.8. Estimated reliability of the system when four components change simultaneously

The model shows that simultaneous deployment of multiple components is inherently risky: it almost always fails the first time. That’s why, in practice, you have scheduled downtime or end up frantically working on the weekend to complete a deployment. You’re really making multiple repeated deployment attempts, trying to resolve production issues that are almost impossible to predict.

The numbers don’t work in your favor. You’re playing a dangerous game. Your releases may be low frequency, but they have high risk.^[18] And it seems as though microservices must introduce even more risk, because you have many more components. Yet, as you’ll discover in this chapter, microservices also provide the flexibility for a solution. If you’re prepared to accept high-frequency releases of single components, then you’ll get much lower risk exposure.

¹⁸
The story of the deployment failure suffered by Knight Capital, in chapter 1, is a perfect example of this danger.

I’ve labored the mathematics to make a point: no software development methodology can defy the laws of probability at reasonable cost. Engineering, not politics, is the key to risk management.

5.3. The centre cannot hold

The collective delusion of enterprise software development is that perfect software can be delivered complete and on time, and deployed to production without errors, by force of management. Any defects are a failure of professionalism on the part of the team. Everybody buys into this delusion. Why?

This book does not take the trite and lazy position that it’s all management’s fault. I don’t pull punches when it comes to calling out bad behavior, but you must be careful to see organizational behavior for what it is: rational.

We can analyze corporate politics using Game Theory.^[19] Why does nobody point out the absurdities of enterprise software development, even when there are mountains of evidence? How many more books must be written on the subject? Fortunately, we live in an age where the scale of the software systems we must build is slowly forcing enterprise software development to face reality.

¹⁹
The part of mathematics that deals with multiplayer games and the limitations of strategies to maximize results.

Traditional software development processes are an unwanted Nash equilibrium in the game of corporate politics. They’re a kind of prisoner’s dilemma.^[20] If all stakeholders acknowledged that failure rates must exist and used that as a starting point, then continuous delivery would be seen as a natural solution. But nobody is willing to do so; it would be a career-limiting move. Failure isn’t an option! So we’re stuck with a collective delusion because we can’t communicate honestly. This book aims to give you some solid principles to start that honest communication.

²⁰
A Nash equilibrium is a state in a game where no player can improve their position by changing strategy unilaterally. The prisoner’s dilemma is a compact example: two guilty criminals who robbed a bank together are captured by the police and placed in separate cells where they can’t communicate. If they both stay silent, then they both get a one-year sentence for possession of stolen cash, but the police can’t prove armed robbery. The police offer a deal separately to each criminal: confess and betray your accomplice, and instead of three years for armed robbery, you’ll only get two years, because you cooperated. The only rational strategy is for each criminal to betray the other and take the two years, because their partner might betray them. Because they can’t communicate, they can’t agree to remain silent.

Warning

It isn’t advisable to push for change unless there’s a forcing function. Wait until failure is inevitable under the old system, and then be the white knight. Pushing for change when you have no leverage is indeed a career-limiting move.

5.3.1. The cost of perfect software

The software that controlled the space shuttle was some of the most perfect software ever written. It’s a good example of how expensive such software truly is and calls out the absurdity of the expectations for enterprise software. It’s also a good example of how much effort is required to build redundant software components.

The initial cost estimate for the shuttle software system was $20 million. The final bill was $200 million. This is the first clue that defect-free software is an order of magnitude more expensive than even software engineers estimate. The full requirements specification has 40,000 pages—for a mere 420,000 lines of code. By comparison, Google’s Chrome web browser has more than 5 million lines of code. How perfect is the shuttle software? On average, there was one bug per release. It wasn’t completely perfect!

The shuttle’s software development process was incredibly strict. It was a traditional process with detailed specifications, strict testing, and code reviews, and bureaucratic signatures were needed for release. Many stakeholders in the enterprise software development process believe that this level of delivery is what they’re going to get.

It’s the business of business to make return-on-investment decisions. You spend money to make money, but you must have a business case. This system breaks down if you don’t understand your cost model. It’s the job of the software architect to make these costs clear and to provide alternatives, where the cost of software development is matched to the expected returns of the project.

5.4. Anarchy works

The most important question in software development is, “What is the acceptable error rate?” This is the first question to ask at the start of a project. It drives all other questions and decisions. It also makes clear to all stakeholders that the process of software development is about controlling, not conquering, failure.

The primary consequence is that large-scale releases can never meet the acceptable error rate. Reliability is so compromised by the uncertainty of a large release that large releases must be rejected as an engineering approach. This is just mathematics, and no amount of QA can overcome it.

Small releases are less risky. The smaller, the better. Small releases have small uncertainties, and you can stay beneath the failure threshold. Small releases also mean frequent releases. Enterprise software must constantly change to meet market forces. These small releases must go all the way to production to fully reduce risk; collecting them into large releases takes you back to square one. That’s just how the probabilities work.

A system that’s under constant failure isn’t fragile. Every component expects others to fail and is built to be tolerant of failure. The constant failure of components exercises redundant systems and backups so you know they work. You have an accurate measure of the failure rate of the system: it’s a known quantity that can be controlled. The rate of deployment can be adjusted as risks grow and shrink.

How does the simple risk model work under these conditions? You may be changing only one component at a time, but aren’t you still subject to large amounts of risk? You know that your software development process won’t deliver updated components that are as stable as those that have been baked into production for a while.

Let’s say updated components are 80% reliable on first deployment. The systems we’ve looked at definitely won’t meet a reliability threshold of 99%. Redeploying a single component still isn’t a small enough deployment. This is an engineering and process problem that we’ll address in the remainder of this chapter: how to make changes to a production software system while maintaining the desired risk tolerance.

5.5. Microservices and redundancy

An individual component of a software system should never be run as a single instance. A single instance is vulnerable to failure: the component could crash, the machine it’s running on could fail, or the network connection to that machine could be accidentally misconfigured. No component should be a single point of failure.

To avoid a single point of failure, you can run multiple instances of the component. Then, you can handle load, and you’re more protected against some kinds of failure. You aren’t protected against software defects in the component, which affect all instances, but such defects can usually be mitigated by automatic restarts.^[21] Once a component has been running in production for a while, you’ll have enough data to get a good measure of its reliability.

²¹
Restarts don’t protect you against nastier kinds of defects, such as poison messages.

How do you deploy a new version of a component? In the traditional model, you try, as quickly as possible, to replace all the old instances with a full set of new ones. The blue-green deployment strategy, as it’s known, is an example of this. You have a running version of the system; call this the blue version. You spin up a new version of the system; call this the green version. Then, you choose a specific moment to redirect all traffic from blue to green. If something goes wrong, you can quickly switch back to blue and assess the damage. At least you’re still up.

One way to make this approach this less risky is to initially redirect only a small fraction of traffic to green. If you’re satisfied that everything still works, redirect greater volumes of traffic until green has completely taken over.

The microservice architecture makes it easy to adopt this strategy and reduce risk even further. Instead of spinning up a full quota of new instances of the green version of the service, you can spin up one. This new instance gets a small portion of all production traffic while the existing blues look after the main bulk of traffic. You can observe the behavior of the single green instance, and if it’s badly behaved, you can decommission it: a small amount of traffic is affected, and there’s a small increase in failure, but you’re still in control. You can fully control the level of exposure by controlling the amount of traffic you send to that single new instance.

Microservice deployments consist of nothing more than introducing a single new instance. If the deployment fails, rollback means decommissioning a single instance. Microservices give you well-defined primitive operations on your production system: add/remove a service instance.^[22] Nothing more is required. These primitive operations can be used to construct any deployment strategy you desire. For example, blue-green deployments break down into a list of add and remove operations on specific instances.

²²
More formally, we might call these primitive operations activate and deactivate, respectively. How the operations work depends entirely on the underlying deployment platform.

Defining a primitive operation is a powerful mechanism for achieving control. If everything is defined in terms of primitives, and you can control the composition of the primitives, then you can control the system. The microservice instance is the primitive and the unit with which you build your systems. Let’s examine the journey of that unit from development to production.

5.6. Continuous delivery

The ability to safely deploy a component to production at any time is powerful because it lets you control risk. Continuous delivery (CD) in a microservice context means the ability to create a specific version of a microservice and to run one or more instances of that version in production, on demand. The essential elements of a CD pipeline are as follows:

A version-controlled local development environment for each service, supported by unit testing, and the ability to test the service against an appropriate local subset of the other services, using mocking if necessary.
A staging environment to both validate the microservice and build, reproducibly, an artifact for deployment. Validation is automated, but scope is allowed for manual verification if necessary.
A management system, used by the development team to execute combinations of primitives against staging and production, implementing the desired deployment patterns in an automated manner.
A production environment that’s constructed from deployment artifacts to the fullest extent possible, with an audit history of the primitive operations applied. The environment is self-correcting and able to take remedial action, such as restarting crashed services. The environment also provides intelligent load balancing, allowing traffic volumes to vary between services.
A monitoring and diagnostics system that verifies the health of the production system after the application of each primitive operation and allows the development team to introspect and trace message behavior. Alerts are generated from this part of the system.

The pipeline assumes that the generation of defective artifacts is a common occurrence. The pipeline attempts to filter them out at each stage. This is done on a per-artifact basis, rather than trying to verify an update to the entire system. As a result, the verification is both more accurate and more credible, because confounding factors have been removed.

Even when a defective artifact makes it to production, this is considered a normal event. The behavior of the artifact is continuously verified in production after deployment, and the artifact is removed if its behavior isn’t acceptable. Risk is controlled by progressively increasing the proportion of activity that the new artifact handles.

CD is based on the reality of software construction and management. It delivers the following:

Lower risk of failure by favoring low-impact, high-frequency, single-instance deployments over high-impact, low-frequency, multiple-instance deployments.
Faster development by enabling high-frequency updates to the business logic of the system, giving a faster feedback loop and faster refinement against business goals.
Lower cost of development, because the fast feedback loop reduces the amount of time wasted on features that have no business value.
Higher quality, because less code is written overall, and the code that’s written is immediately verified.

The tooling to support CD and the microservice architecture is still in the early stages of development. Although an end-to-end CD pipeline system is necessary to fully gain the benefits of the microservice architecture, it’s possible to live with pipeline elements that are less than perfect.

At the time of writing, all teams working with this approach are using multiple tools to implement different aspects of the pipeline, because comprehensive solutions don’t exist. The microservice architecture requires more than current platform-as-a-service (PaaS) that vendors offer. Even when comprehensive solutions emerge, they will present trade-offs in implementation focus.^[23]

²³
The Netflix suite (http://netflix.github.io) is a good example of a comprehensive, but opinionated, toolchain.

You’ll probably continue to need to put together a context-specific toolset for each microservice system you build; as we work through the rest of this chapter, focus on the desirable properties of these tools. You’ll almost certainly also need to invest in developing some of your own tooling—at the very least, integration scripts for the third-party tools you select.

5.6.1. Pipeline

The purpose of the CD pipeline is to provide feedback to the development team as quickly as possible. In the case of failure, that feedback should indicate the nature of the failure; it must be easy to see the failing tests, the failing performance results, or the failed integrations. You also should be able to see a history of the verifications and failures of each microservice. This isn’t the time to roll your own tooling—many capable continuous integration (CI) tools are available.^[24] The key requirement is that your chosen tool be able to handle many projects easily, because each microservice is built separately.

²⁴
Two quick mentions: if you want to run something yourself, try Hudson (http://hudson-ci.org); if you want to outsource, try Travis CI (http://travis-ci.org).

The CI tool is just one stage of the pipeline, usually operating before a deployment to the staging systems. You need to be able to trace the generation of microservices throughout the pipeline. The CI server generates an artifact that will be deployed into production. Before that happens, the source code for the artifact needs to be marked and tagged so that artifact generation can be hermetic—you must be able to reproduce any build from the history of your microservice. After artifact generation, you must be able to trace the deployment of the artifact over your systems from staging to production. This tracing must be not only at the system level, but also within the system, tracing the number of instances run, and when. Until third-party tooling solves this problem, you’ll have to build this part of the pipeline diagnostics yourself; it’s an essential and worthwhile investment for investigating failures.

The unit of deployment is a microservice, so the unit of movement through the pipeline is a microservice. The pipeline should prioritize the focus on the generation and validation of artifacts that represent a microservice. A given version of a microservice is instantiated as a fixed artifact that never changes. Artifacts are immutable: the same version of a microservice always generates the same artifact, at a binary-encoding level. It’s natural to store these artifacts for quick access.^[25] Nonetheless, you need to retain the ability to hermetically rebuild any version of a microservice, because the build process is an important element of defect investigation.

²⁵
Amazon S3 isn’t a bad place to store them. There are also more-focused solutions, such as JFrog Artifactory (www.jfrog.com/artifactory).

The development environment also needs to make the focus on individual micro-services fluid and natural. In particular, this affects the structure of your source code repositories. (We’ll look at this more deeply in chapter 7.) Local validation is also important, as the first measure of risk. Once a developer is satisfied that a viable version of the microservice is ready, the developer initiates the pipeline to production.

The staging environment reproduces the development-environment validation in a controlled environment so that it isn’t subject to variances in local developer machines. Staging also performs scaling and performance tests and can use multiple machines to simulate production, to a limited extent. Staging’s core responsibility is to generate an artifact that has an estimated failure risk that’s within a defined tolerance.

Production is the live, revenue-generating part of the pipeline. Production is updated by accepting an artifact and a deployment plan and applying that deployment plan under measurement of risk. To manage risk, the deployment plan is a progressive execution of deployment primitives—activating and deactivating microservice instances. Tooling for production microservices is the most mature at present because it’s the most critical part of the pipeline. Many orchestration and monitoring tools are available to help.^[26]

²⁶
Common choices are Kubernetes (https://kubernetes.io), Mesos (http://mesos.apache.org), Docker (www.docker.com), and so forth. Although these tools fall into a broad category, they operate at different levels of the stack and aren’t mutually exclusive. The case study in chapter 9 uses Docker and Kubernetes.

5.6.2. Process

It’s important to distinguish continuous delivery from continuous deployment. Continuous deployment is a form of CD where commits, even if automatically verified, are pushed directly and immediately to production. CD operates at a coarser grain: sets of commits are packaged into immutable artifacts. In both cases, deployments can effectively take place in real time and occur multiple times per day.

CD is more suited to the wider context of enterprise software development because it lets teams accommodate compliance and process requirements that are difficult to change within the lifetime of the project. CD is also more suited to the microservice architecture, because it allows the focus to be on the microservice rather than code.

If we view “continuous delivery” as meaning continuous delivery of microservice instances, this understanding drives other virtues. Microservices should be kept small so that verification—especially human verification, such as code reviews—is possible with the desired time frames of multiple deployments per day.

5.6.3. Protection

The pipeline protects you from exceeding failure thresholds by providing measures of risk at each stage of production. It isn’t necessary to extract a failure-probability prediction from each measure.^[27] You’ll know the feel of the measures for your system, and thus you can use a scoring approach just as effectively.

²⁷
You could use statistical techniques such as Bayesian estimation to do this if desired.

In development, the key risk-measuring tools are code reviews and unit tests. Using modern version control for branch management ^[28] means you can adopt a development workflow where new code is written on a branch and then merged into the mainline. The merge is performed only if the code passes a review. The review can be performed by a peer, rather than a senior developer: peer developers on a project have more information and are better able to assess the correctness of a merge. This workflow means code review is a normal part of the development process and very low friction. Microservices keep the friction even lower because the units of review are smaller and have less code.

²⁸
Using a distributed version control system such as Git is essential. You need to be able to use pull requests to implement code reviews.

Unit tests are critical to risk measurement. You should take the perspective that unit tests must pass before branches can be merged or code committed on the mainline. This keeps the mainline potentially deployable at all times, because a build on staging has a good chance of passing. Unit tests in the microservice world are concerned with demonstrating the correctness of the code; other benefits of unit testing, such as making refactoring safer, are less relevant.

Unit tests aren’t sufficient for accurate risk measurement and are very much subject to diminishing marginal returns: moving from 50% test coverage to 100% reduces deployment risk much less than moving from 0% to 50%. Don’t get suckered into the fashion for 100% test coverage—it’s a fine badge of honor (literally!) for open source utility components but is superstitious theater for business logic.

On the staging system, you can measure the behavior of a microservice in terms of its adherence to the message flows of the system. Ensuring that the correct messages are sent by the service, and that the correct responses are given, is also a binary pass/fail test, which you can score with a 0 or 1. The service must meet expectations fully. Although these message interactions are tested via unit tests in development, they also need to be tested on the staging system, because this is a closer simulation of production.

Integrations with other parts of the system can also be tested as part of the staging process. Those parts of the system that aren’t microservices, such as standalone databases, network services such a mail servers, external web service endpoints, and others, are simulated or run in small scale. You can then measure the microservice’s behavior with respect to them. Other aspects of the service, such as performance, resource consumption, and security, need to be measured statistically: take samples of behavior, and use them to predict the risk of failure.

Finally, even in production, you must continue to measure the risk of failure. Even before going into production, you can establish manual gates—formal code reviews, penetration testing, user acceptance, and so forth. These may be legally unavoidable (due to laws affecting your industry), but they can still be integrated into the continuous delivery mindset.

Running services can be monitored and sampled. You can use key metrics, especially those relating to message flow rates, to determine service and system health. Chapter 6 has a great deal more to say about this aspect of the microservice architecture.

5.7. Running a microservice system

The tooling to support microservices is developing quickly, and new tools are emerging at a high rate. It isn’t useful to examine in detail something that will soon be out of date, so this chapter focuses on general principles; that way, you can compare and assess tools and select those most suitable for your context. You should expect and prepare to build some tooling yourself. This isn’t a book on deployment in general, so it doesn’t discuss best practices for deploying system elements, such as database clusters, that aren’t microservices. It’s still recommended that these be subject to automation and, if possible, controlled by the same tooling. The focus of this chapter is on the deployment of your own microservices. Your own microservices implement the business logic of the system and are thus subject to a higher degree of change compared to other elements.

5.7.1. Immutability

It’s a core principle of the approach described here that microservice artifacts are immutable. This preserves their ability to act as primitive operations. A microservice artifact can be a container, a virtual machine image, or some other abstraction.^[29] The essential characteristics of the artifact are that it can’t be changed internally and that it has only two states: active and inactive.

²⁹
For very large systems, you might even consider an AWS autoscaling group to be your base unit.

The power of immutability is that it excludes side effects from the system. The behavior of the system and microservice instances is more predictable, because you can be sure they aren’t affected by changes you aren’t aware of. An immutable artifact contains everything the microservice needs to run, at fixed versions. You can be absolutely sure that your language-platform version, libraries, and other dependencies are exactly as you expect. Nobody can manually log in to the instance and make unaudited changes. This predictability allows you to calibrate risk estimations more accurately.

Running immutable instances also forces you to treat microservices as disposable. An instance that develops a problem or contains a bug can’t be “fixed”; it can only be deactivated and replaced by a new instance. Matching capacity to load isn’t about building new installations on bigger machines, it’s about running more instances of the artifact. No individual instance is in any way special. This approach is a basic building block for building reliable systems on unreliable infrastructure.

The following subsections provide a reference overview of microservice deployment patterns. You’ll need to compare these patterns against the capabilities of the automation tooling you’re using. Unfortunately, you should expect to be disappointed, and you’ll need to augment your tooling to fully achieve the desired benefits of the patterns.

Feel free to treat these sections as a recipe book for cooking up your own patterns rather than a prescription. You can skim the deployment patterns without guilt.^[30] To kick things off, figure 5.9 is a reminder of the diagramming conventions. In particular, the number of live instances is shown in braces ({1}) above the name of the microservice, and the version is shown below (1.0.0).

³⁰
In production, I’m fondest of the Progressive Canary pattern.

Figure 5.9. Microservice diagram key

Rollback pattern

This deployment pattern is used to recover from a deployment that has caused one or more failure metrics to exceed their thresholds. It enables you to deploy new service versions where the estimated probability of failure is higher than your thresholds, while maintaining overall operation within thresholds.

To use the Rollback pattern (figure 5.10), apply to a system the activate primitive for a given microservice artifact. Observe failure alerts, and then deactivate the same artifact. Deactivation can be manual or automatic; logs should be preserved. Deactivation should return the system to health. Recovery is expected but may not occur in cases where the offending instance has injected defective messages into the system (for this, see the Kill Switch pattern).

Figure 5.10. Rollback pattern sequence

Homeostasis pattern

This pattern (figure 5.11) lets you maintain desired architectural structure and capacity levels. You implement a declarative definition of your architecture, including rules for increasing capacity under load, by applying activation and deactivation primitives to the system. Simultaneous application of primitives is permitted, although you must take care to implement and record this correctly. Homeostasis can also be implemented by allowing services to issue primitive operations, and defining local rules for doing so (see the Mitosis and Apoptosis patterns).

Figure 5.11. Homeostasis pattern sequence

History pattern

The History pattern (figure 5.12) provides diagnostic data to aid your understanding of failing and healthy system behavior. Using this pattern, you maintain an audit trail of the temporal order of primitive operation application—a log of what was activated/deactivated, and when. A complication is that you may allow simultaneous application of sets of primitives in your system.

Figure 5.12. History pattern sequence

The audit history lets you diagnose problems by inspecting the behavior of previous versions of the system—these can be resurrected by applying the primitives to simulated systems. You can also deal with defects that are introduced but not detected immediately by moving backward over the history.

5.7.2. Automation

Microservice systems in production have too many moving parts to be managed manually. This is part of the trade-off of the architecture. You must commit to using tooling to automate your system—and this is a never-ending task. Automation doesn’t cover all activities from day one, nor should it, because you need to allocate most of your development effort to creating business value. Over time, you’ll need to automate more and more.

To determine which activity to automate next, divide your operational tasks into two categories. In the first category, Toil,^[31] place those tasks where effort grows at least linearly with the size of the system. To put it another way, from a computational complexity perspective, these are tasks where human effort is at least O(n), where n is the number of microservice types (not instances). For example, configuring log capture for a new microservice type might require manual configuration of the log-collection subsystem. In the second category, Win, place tasks that are less than O(n) in the number of microservice types: for example, adding a new database secondary reader instance to handle projected increased data volumes.

³¹
This usage of the term originates with the Google Site Reliability Engineering team.

The next task to automate is the most annoying task from the Toil pile, where annoying means “most negatively impacting the business goals.” Don’t forget to include failure risk in your calculation of negative impact.

Automation is also necessary to execute the microservice deployment patterns. Most of the patterns require the application of large numbers of primitive operations over a scheduled period of time, under observation for failure. These are tasks that can’t be performed manually at scale.

Automation tooling is relatively mature and dovetails with the requirements of modern, large-scale enterprise applications. A wide range of tooling options are available, and your decision should be driven by your comfort levels with custom modification or scripting; you’ll need to do some customization to fully execute the microservice deployment patterns described next.^[32]

³²
Plan to evaluate tools such as Puppet (https://puppet.com), Chef (www.chef.io/chef), Ansible (www.ansible.com), Terraform (www.terraform.io), and AWS CodeDeploy (https://aws.amazon.com/codedeploy).

Canary pattern

New microservices and new versions of existing microservices introduce considerable risk to a production system. It’s unwise to deploy the new instances and immediately allow them to take large amounts of load. Instead, run multiple instances of known-good services, and slowly replace these with new ones.

The first step in the Canary pattern (figure 5.13) is to validate that the new micro-service both functions correctly and isn’t destructive. To do so, you activate a single new instance and direct a small amount of message traffic to this instance. Then, watch your metrics to make sure the system behaves as expected. If it doesn’t, apply the Rollback pattern.

Figure 5.13. Canary pattern sequence

Progressive Canary pattern

This pattern (figure 5.14) lets you reduce the risk of a full update by applying changes progressively in larger and larger tranches. Although Canary can validate the safety of a single new instance, it doesn’t guarantee good behavior at scale, particularly with respect to unintended destructive behavior. With the Progressive Canary pattern, you deploy a progressively larger number of new instances to take progressively more traffic, continuing to validate during the process. This balances the need for full deployment of new instance versions to occur at reasonable speed with the need to manage the risk of the change.

Figure 5.14. Progressive Canary pattern sequence

Primitives are applied concurrently in this pattern. The Rollback pattern can be extended to handle decommissioning of multiple instances if a problem does arise.

Bake pattern

The Bake pattern (figure 5.15) reduces the risk of failures that have severe downsides. It’s a variation of Progressive Canary that maintains a full complement of existing instances but also sends a copy of inbound message traffic to the new instances. The output from the new instances is compared with the old to ensure that deviations are below the desired threshold. Output from the new instances is discarded until this criterion is met. The system can continue in this configuration, validating against production traffic, until sufficient time has passed to reach the desired risk level.

Figure 5.15. Bake pattern sequence

This pattern is most useful when the output must meet a strict failure threshold and where failure places the business at risk. Consider using Bake when you’re dealing with access to sensitive data, financial operations, and resource-intensive activities that are difficult to reverse.^[33] The pattern does require intelligent load balancing and additional monitoring to implement.

³³
The canonical description of this technique is given by GitHub’s Zach Holman in his talk “Move Fast & Break Nothing,” October 2014 (https://zachholman.com/talk/move-fast-break-nothing). It isn’t necessary to fully replicate the entire production stack; you only need to duplicate a sample of production traffic to measure correctness within acceptable levels of risk.

Merge pattern

Performance is impacted by network latency. As the system grows and load increases, certain message pathways will become bottlenecks. In particular, latency caused by the need to send messages over the network between microservices may become unacceptable. Also, security concerns may arise that require encryption of the message stream, causing further latency.

To counteract this issue, you can trade some of the flexibility of microservices for performance by merging microservices in the critical message path. By using a message abstraction layer and pattern matching, as discussed in earlier chapters, you can do so with minimal code changes. Don’t merge microservices wholesale; try to isolate the message patterns you’re concerned with into a combined microservice. By executing a message pathway within a single process, you remove the network from the equation.

The Merge pattern (figure 5.16) is a good example of the benefit of the microservices-first approach. In the earlier part of an application’s lifecycle, more flexibility is needed, because understanding of the business logic is less solid. Later, you may require optimizations to meet performance goals.

Figure 5.16. Merge pattern sequence

Split pattern

Microservices grow over time, as more business logic is added, so you need to add new kinds of services to avoid building technical debt. In the early lifecycle of a system, microservices are small and handle general cases. As time goes by, more special cases are added to the business logic. Rather than handling these cases with more complex internal code and data structures, it’s better to split out special cases into focused microservices. Pattern matching on messages makes this practical and is one of the core benefits of the pattern-matching approach.

The Split pattern (figure 5.17) captures one of the core benefits of the microservice architecture: the ability to handle frequently changing, underspecified requirements. Always look for opportunities to split, and avoid the temptation to use more-familiar language constructs (such as object-oriented design patterns), because they build technical debt over time.

Figure 5.17. Split pattern sequence

5.7.3. Resilience

Chapter 3 discussed some of the common failure modes of microservice systems. A production deployment of microservices needs to be resilient to these failure modes. Although the system can never be entirely safe, you should put mitigations in place. As always, the extent and cost of the mitigation should correspond to the desired level of risk.

In monolithic systems, failure is often dealt with by restarting the instances of that fail. This approach is heavy-handed and often not very effective. The microservice architecture offers a finer-grained menu of techniques for handling failure. The abstraction of a messaging layer is helpful, because this layer can be extended to provide automatic safety devices (ASDs). Bear in mind that ASDs aren’t silver bullets, and may themselves cause failure, but they’re still useful for many modes of failure.

Slow downstream

In this failure mode, taking the perspective of a given client microservice instance, responses to outbound messages have latency or throughput levels that are outside of acceptable levels. The client microservice can use the following dynamic tactics, roughly in order of increasing sophistication:

Timeouts— Consider messages failed if a response isn’t delivered within a fixed timeout period. This prevents resource consumption on the client microservice.
Adaptive timeouts— Use timeouts, but don’t set them as fixed configuration parameters. Instead, dynamically adjust the timeouts based on observed behavior. As a simplistic example, time out if the response delay is more than three standard deviations from the observed mean response time. Adaptive timeouts reduce the occurrence of false positives when the overall system is slow and avoid delays in failure detection when the system is fast.
Circuit breaker— Persistently slow downstream services should be considered effectively dead. Implementation requires the messaging layer to maintain metadata about downstream services. This tactic avoids unnecessary resource consumption and unnecessary degradation of overall performance. It does increase the risk of overloading healthy machines by redirecting too much traffic to them, causing a cascading failure similar to the unintended effects of the ASDs at Three Mile Island.
Retries— If failure to execute a task has a cost, and there’s tolerance for delay, it may make more sense to retry a failed message by sending it again. This is an ASD that has great potential to go wrong. Large volumes of retries are a self-inflicted denial of service (DoS) attack. Use a retry budget to avoid this by retrying only a limited number of times and, if the metadata is available, doing so only for a limited number of times per downstream. Retries should also use a randomized exponential backoff delay before being sent, because this gives the downstream a better chance of recovery by spreading load over time.
Intelligent round-robin— If the messaging layer is using point-to-point transmission to send messages, then it necessarily has sufficient metadata to implement round-robin load balancing among the downstreams. Simple round-robin keeps a list of downstreams and cycles through them. This ignores differences in load between messages and can lead to individual downstreams becoming overloaded. Random round-robin is found empirically to be little better, probably because the same clustering of load is possible. If the downstream microservices provide backpressure metadata, then the round-robin algorithm can make more-informed choices: it can choose the least loaded downstream, weight downstreams based on known capacity, and restrict downstreams to a subset of the total to avoid a domino effect from a circuit breaker that trips too aggressively.

Upstream overload

This is the other end of the overload scenario: the downstream microservice is getting too many inbound messages. Some of the tactics to apply are as follows:

Adaptive throttles— Don’t attempt to complete all work as it comes in. Instead, queue the work to a maximum rate that can be safely handled. This prevents the service from thrashing. Services that are severely resource constrained will spend almost all of their time swapping between tasks rather than working on tasks. On thread-based language platforms, this consumes memory; and on event-driven platforms, this manifests as a single task hogging the CPU and stalling all other tasks. As with timeouts, it’s worth making throttles adaptive to optimize resource consumption.
Backpressure— Provide client microservices with metadata describing current load levels. This metadata can be embedded in message responses. The downstream service doesn’t actively handle load but relies on the kindness of its clients. The metadata makes client tactics for slow downstreams more effective.
Load shedding— Refuse to execute tasks once a dangerous level of load has been reached. This is a deliberate decision to fail a certain percentage of messages. This tactic gives most messages reasonable latency, and some total failure, rather than allowing many messages to have high latency with sporadic failure. Appropriate metadata should be returned to client services so they don’t interpret load shedding incorrectly and trigger a circuit breaker. The selection of tasks to drop, add to the queue, or execute immediately can be determined algorithmically and is context dependent. Nonetheless, even a simple load shedder will prevent many kinds of catastrophic collapse.

In addition to these dynamic tactics, upstream overload can be reduced on a longer time frame by applying the Merge deployment pattern.

Lost actions

To address this failure mode, apply the Progressive Canary deployment pattern, measuring message-flow rates to ensure correctness. Chapter 6 discusses measurement in more detail.

Poison messages

In this failure mode, a microservice generates a poisonous message that triggers a defect in other microservices, causing some level of failure. If the message is continuously retried against different downstream services, they all suffer failures. You can respond in one of these ways:

Drop duplicates— Downstream microservices should track message-correlation identifiers and keep a short-term record of recently seen inbound messages. Duplicates should be ignored.
Validation— Trade the flexibility of schema-free messages for stricter validation of inbound message data. This has a less detrimental effect later in the project when the pace of requirement change has slowed.

Consider building a dead-letter service. Problematic messages are forwarded to this service for storage and later diagnosis. This also allows you to monitor message health across the system.

Guaranteed delivery

Message delivery may fail in many ways. Messages may not arrive or may arrive multiple times. Dropping duplicates will help within a service. Duplicated messages sent to multiple services are more difficult to mitigate. If the risk associated with such events is too high, allocate extra development effort to implement idempotent message interactions.^[34]

³⁴
Be careful not to overthink your system in the early days of a project. It’s often better to accept the risk of data corruption in order to get to market sooner. Be ethical, and make this decision openly with your stakeholders. Chapter 8 has more about making such decisions.

Emergent behavior

A microservice system has many moving parts. Message behavior may have unintended consequences, such as triggering additional workflows. You can use correlation identifiers for after-the-fact diagnosis, but not to actively prevent side effects:

Time to live— Use a decrementing counter that’s reduced each time an inbound message triggers the generation of outbound messages. This prevents unbounded side effects from proceeding without any checks. In particular, it stops infinite message loops. It won’t fully prevent all side effects but will limit their effects. You’ll need to determine the appropriate value of the counter in the context of your own system, but you should prefer low values. Microservice systems should be shallow, not deep.

Catastrophic collapse

Some emergent behavior can be exceptionally pathological, placing the system into an unrecoverable state, even though the original messages are no longer present. In this failure mode, even with the Homeostasis pattern in place, service restarts can’t bring the system back to health.

For example, a defect may crash a large number of services in rapid succession. New services are started as replacements, but they have empty caches and are thus unable to handle current load levels. These new services crash and are themselves replaced. The system can’t establish enough capacity to return to normal. This is known as the thundering herd problem. Here are some ways to address it:

Static responses— Use low-resource emergency microservices that return hardcoded responses to temporarily take load.
Kill switch— Establish a mechanism to selectively stop large subsets of services. This gives you the ability to quarantine the problem. You can then restart into a known-good state.

In addition to using these dynamic tactics, you can prepare for disaster by deliberately testing individual services with high load to determine their failure points. Software systems tend to fail quickly rather than gradually, so you need to establish safe limits in advance.

The following sections describe microservice deployment patterns that provide resilience.

Apoptosis pattern

The Apoptosis^[35] pattern removes malfunctioning services quickly, thus reducing capacity organically. Microservices can perform self-diagnosis and shut themselves down if their health is unsatisfactory. For example, all message tasks may be failing because local storage is full; the service can maintain internal statistics to calculate health. This approach also enables a graceful shutdown by responding to messages with metadata indicating a failure during the shutdown, rather than triggering timeouts.

³⁵
Living cells commit suicide if they become too damaged: apoptosis. This prevents cell damage from accumulating and causing cancers.

Apoptosis (figure 5.18) is also useful for matching capacity with load. It’s costly to maintain active resources far in excess of levels necessary to meet current load. Services can choose to self-terminate, using a probabilistic algorithm to avoid mass shutdowns. Load is redistributed over the remaining services.

Figure 5.18. Apoptosis pattern sequence

Mitosis pattern

The Mitosis^[36] pattern responds to increased load organically, without centralized control. Individual microservices have the most accurate measure of their own load levels. You can trigger the launching of new instances if local load levels are too high; this should be done using a probabilistic approach, to avoid large numbers of simultaneous launches. The newly launched service will take some of the load, bringing levels back to acceptable ranges.

³⁶
Living cells replicate by splitting in two. Mitosis is the name of this process.

Mitosis (figure 5.19) and Apoptosis should be used with care and with built-in limits. You don’t want unbounded growth or a complete shutdown. Launch and shutdown should occur via primitive operations executed by the infrastructure tooling, not by the microservices.

Figure 5.19. Mitosis pattern sequence

Kill Switch pattern

With this pattern (figure 5.20), you disable large parts of the system to limit damage. Microservices systems are complex, just like the Three Mile Island reactor. Failure events at all scales are to be expected. Empirically, these events follow a power law in terms of occurrence. Eventually, an event with potential for significant damage will occur.

Figure 5.20. Kill Switch pattern sequence

To limit the damage, rapid action is required. It’s impossible to understand the event during its occurrence, so the safest course of action is to scram the system. You should be able to shut down large parts of the system using secondary communication links to each microservice. As the event progresses, you may need to progressively shut down more and more of the system in order to eventually contain the damage.

5.7.4. Validation

Continuous validation of the production system is the key practice that makes micro-services successful. No other activity provides as much risk reduction and value. It’s the only way to run a CD pipeline responsibly.

What do you measure in production? CPU load levels? Memory usage? These are useful but not essential. Far more important is validation that the system is behaving as desired. Messages correspond to business activities or parts of business activities, so you should focus on the behavior of messages. Message-flow rates tell you a great deal about the health of the system. On their own, they’re less useful than you may think, because they fluctuate with the time of day and with seasonality; it’s more useful to compare message-flow rates with each other.

A given business process is encoded by an expanding set of messages generated from an initial triggering message. Thus, message-flow rates are related to each other by ratios. For example, for every message of a certain kind, you expect to see two messages of another kind. These ratios don’t change, no matter what the load level of the system or the number of services present. They’re invariant.

Invariants are the primary indicator of health. When you deploy a new version of a microservice using the Canary pattern, you check that the invariants are within expected bounds. If the new version contains a defect, the message-flow ratios will change, because some messages won’t be generated. This is an immediate indicator of failure. Invariants, after all, can’t vary. We’ll come back to this topic in chapter 6 and examine an example in chapter 9.

The following sections present some applicable microservice deployment patterns.

Version Update pattern

This pattern (figure 5.21) lets you safely update a set of communicating microservices. Suppose that microservices A and B communicate using messages of kind x. New business requirements introduce the need for messages of kind y between the services. It’s unwise to attempt a simultaneous update of both; it’s preferable to use the Progressive Canary deployment pattern to make the change safely.

Figure 5.21. Version Update pattern sequence

First, you update listening service B so that it can recognize the new message, y. No other services generate this message in production yet, but you can validate that the new version of B doesn’t cause damage. Once the new B is in place, you update A, which emits y messages.

This multistage update (composed of Progressive Canaries for each stage) can be used for many scenarios where message interactions need to change. You can use it when the internal data of the messages changes. (B, in this case, must retain the ability to handle old messages until the change is complete.) You can also use it to inject a third service between two existing services, by applying the pattern first to one side of the interaction and then to the other. This is a common way to introduce a cache, such as the one you saw in chapter 1.

Chaos pattern

You can ensure that a system is resistant to failure by constantly failing at a low rate. Services can develop fragile dependencies on other services, despite your best intentions. When dependencies fail, even when that failure is below the acceptable threshold, the cumulative effect can cause threshold failures in client services.

To prevent creeping fragility, deliberately fail your services on a continuous basis, in production. Calibrate the failure to be well below the failure threshold for the business, so it doesn’t have a significant impact on business outcomes. This is effectively a form of insurance: you take small, frequent, losses to avoid large, infrequent, losses that are fatal.

The most famous example of the Chaos pattern is the Netflix Chaos Monkey, which randomly shuts down services in the Netflix infrastructure. Another example is Google Game Days, where large-scale production failures are deliberately triggered to test failover capabilities.

5.7.5. Discovery

Pattern matching and transport independence give you decoupled services. When microservice A knows that microservice B will receive its messages, then A is coupled to B. Unfortunately, message transport requires knowledge of the location of B; otherwise, messages can’t be delivered. Transport independence hides the mechanism of transportation from A, and pattern matching hides the identity of B. Identity is coupling.

The messaging abstraction layer needs to know the location of B, even as it hides this information from A. A (or, at least, the message layer in A) needs to discover the location of B (in reality, the set of locations of all the instances of B). This is a primary infrastructural challenge in microservice systems. Let’s examine the common solutions:

Embedded configuration— Hardcode service locations as part of the immutable artifact.
Intelligent load balancing— Direct all message traffic through load balancers that know where to find the services.
Service registries— Services register their location with a central registry, and other services look them up in the registry.
DNS— Use the DNS protocol to resolve the location of a service.
Message bus— Use a message bus to separate publishers from subscribers.
Gossip— Use a peer-to-peer membership gossip protocol to share service locations.

No solution is perfect, and they all involve trade-offs, as listed in table 5.1.

Table 5.1. Service-discovery methods

Discovery	Advantages	Disadvantages
Embedded configuration	Easy implementation. Works (mostly) for small static systems.	Doesn’t scale, because large systems are under continuous change. Very strong identity concept: raw network location.
Intelligent load balancing	Scalable with proven production-quality tooling. Examples: NGINX and Netflix’s Hystrix.	Non-microservice network element that requires separate management. Load balancers force limited transport options and must still discover service locations themselves using one of the other discovery mechanisms. Retains identity concept: request URL.
Registry	Scalable with proven production-quality tooling. Examples: Consul, ZooKeeper, and etcd.	Non-microservice network element. High dependency on the chosen solution, because there are no common standards. Strong identity concept: service name key.
DNS	Unlimited scale and proven production-quality tooling. Well understood. Can be used by other mechanisms to weaken identity by replacing raw network locations.	Non-microservice network element. Management overhead. Weak identity concept: hostname.
Message bus	Scalable with proven production-quality tooling. Examples: RabbitMQ, Kafka, and NServiceBus.	Non-microservice network element. Management overhead. Weak identity concept: topic name.
Gossip	No concept of identity! Doesn’t require additional network elements. Early adopter stage, but shown to work at scale.^[37]	Message layer must have additional intelligence to handle load balancing. Rapidly evolving implementations.

³⁷
The SWIM algorithm has found success at Uber. See “How Ringpop from Uber Engineering Helps Distribute Your Application” by Lucie Lozinski, February 4, 2016, https://eng.uber.com/intro-to-ringpop0.

5.7.6. Configuration

How do you configure your microservices? Configuration is one of the primary causes of deployment failure. Does configuration live with the service, immutably packaged into the artifact? Or does configuration live on the network, able to adapt dynamically to live conditions, and providing an additional way to control services?

If configuration is packaged with the service, then configuration changes aren’t different from code changes and must be pushed through the CD pipeline in the same manner. Although this does provide better risk management, it also means you may suffer unacceptable delays when you need to make changes to configuration. You also add additional entries to the artifact store and audit logs for every configuration change, which can clutter these databases and make them less useful. Finally, some network components, such as intelligent load balancers, still need dynamic configuration if they’re to be useful, so you can’t place all configuration in artifacts.

On the other hand, network configuration removes your ability to reproduce the system deterministically or to benefit fully from the safety of immutability. The same artifact deployed tomorrow could fail even though it worked today. You need to define separate change-control processes and controls for configuration, because you can’t reuse the artifact pipeline for this purpose. You’ll need to deploy network services and infrastructure to store and serve configuration. Even if most configuration is dynamic, you still have to bake in at least some configuration—in particular, the network location of the configuration store! When you look closely, you find that many services have large amounts of potential configuration arising from third-party libraries included in them. You need to decide to what extent you’ll expose this via the dynamic-configuration store. You’re unlikely to find much value in exposing all of it. The most practical option is to embed low-level configuration into your deployment artifacts.

You’ll end up with a hybrid solution, because neither approach provides a total solution. The immutable-packaging approach has the advantage of reusing the delivery pipeline as a control mechanism and offering more predictable state. Placing most of your configuration into immutable artifacts is a reasonable trade-off. Nonetheless, you should plan for the provision and management of dynamic configuration.

There are two dangerous anti-patterns to avoid when it comes to configuration. Using microservices doesn’t protect you from them, so remain vigilant:

Automation workarounds— Configuration can be used to code around limitations of your automation tooling: for example, using feature flags rather than generating new artifacts. If you do too much of this, you’ll create an uncontrolled secondary command structure that damages the properties of the system that make immutability so powerful.
Turing’s revenge— Configuration formats tend to be extended with programming constructs over time, mostly as conveniences to avoid repetition.^[38] Now, you have a new, unasked-for programming language in your system that has no formal grammar, undefined behavior, and no debugging tools. Good luck!

³⁸
Initially declarative domain-specific languages, such as configuration formats, tend to accumulate programmatic features over time. It’s surprisingly easy to achieve Turing completeness with a limited set of operations.

5.7.7. Security

The microservice architecture doesn’t offer any inherent security benefits and can introduce new attack vectors if you aren’t careful. In particular, there’s a common temptation to share microservice messages with the outside world. This is dangerous, because it exposes every microservice as an attack surface.

There must be an absolute separation: you need a demilitarized zone (DMZ) between the internal world of microservice messages and the outside world of third-party clients. The DMZ must translate between the two. In practice, this means a microservice system should expose traditional integration points, such as REST APIs, and then convert requests to these APIs into messages. This allows for strict sanitization of input.

Internally, you can’t ignore that fact that microservices communicate over a network, and networks represent an opportunity for attack. Your microservices should live in their own private networks with well-defined ingress and egress routes. The rest of the system uses these routes to interact with the microservice system as a whole. The specific microservices to which messages are routed aren’t exposed.

These precautions may still be insufficient, and you need to consider the case where an attacker has some level of access to the microservice network. You can apply the security principle of defense in depth to strengthen your security in layers. There’s always a trade-off between stronger security and operational impact.

Let’s build up a few layers. Microservices can be given a right of refusal, and they can be made more pedantic in the messages they accept. Highly paranoid services will lead to higher error rates but can delay attackers and make attacks more expensive. For example, you can limit the amount of data a service will return for any one request. This approach means custom work for each service.

Communication between services can require shared secrets and, as a further layer, signed messages. This protects against messages injected into the network by an attacker. The distribution and cycling of the secrets introduces operational complexity. The signing of messages requires key distribution and introduces latency.

If your data is sensitive, you can encrypt all communication between microservices. This also introduces latency and management overhead and isn’t to be undertaken lightly. Consider using the Merge pattern for extremely sensitive data flows, avoiding the network as much as possible.

You need secure storage and management of secrets and encryption keys in order for these layers to be effective. There’s no point in encrypting messages if the keys are easily accessible within the network—but your microservices must be able to access the secrets and keys. To solve this problem, you need to introduce another network element: a key-management service that provides secure storage, access control, and audit capabilities.^[39]

³⁹
Some examples are HashiCorp Vault (www.vaultproject.io), AWS key-value stores (KVS), and, if money is no object, hardware security modules (HSMs).

5.7.8. Staging

The staging system is the control mechanism for the CD pipeline. It encompasses traditional elements, such as a build server for CI. It can also consist of multiple systems that test various aspects of the system, such as performance.

The staging system can also be used to provide manual gates to the delivery pipeline. These are often unavoidable, either politically or legally. Over time, the effectiveness of CD in managing risk and delivering business value quickly can be used to create sufficient organizational confidence to relax overly ceremonial manual sign-offs.

The staging system provides a self-service mechanism for development teams to push updates all the way to production. Empowering teams to do this is a critical component in the success of CD. Chapter 7 discusses this human factor.

Staging should collect statistics to measure the velocity and quality of code delivery over time. It’s important to know how long it takes, on average, to take code from concept to production, for a given risk level, because this tells you how efficient your CD pipeline is.

The staging system has the most variance between projects and organizations. The level of testing, the number of staging steps, and the mechanism of artifact generation are all highly context-specific. As you increase the use of microservices and CD in your organization, avoid being too prescriptive in your definition of the staging function; you must allow teams to adapt to their own circumstances.

5.7.9. Development

The development environment needed for microservices should enable developers to focus on a small set of services at a time—often, a single service. The message-abstraction layer comes into its own here, because it makes it easy to mock the behavior of other services.^[40] Instead of having to implement a complex object hierarchy, microservice mocking only requires implementing sample message flows. This makes it possible to unit-test microservices in complete isolation from the rest of the system.

⁴⁰
For a practical example, see the code in chapter 9.

Microservices can be specified as a relation between inbound and outbound messages. This allows you to focus on a small part of the system. It also enables more-efficient parallel work, because messages from other microservices (which may not yet exist) can easily be mocked.

Isolation isn’t always possible or appropriate. Developers often need to run small subsets of the system locally, and tooling is needed to make this practical. It isn’t advisable for development to become dependent on running a full replica of the production system. As production grows to hundreds of different services and beyond, it becomes extremely resource intensive to run services locally, and, ultimately, doing so becomes impossible.

If you’re running only a subset of the system, how do you ensure that appropriate messages are provided for the other parts of the system? One common anti-pattern is to use the build or staging system to do this. You end up working against shared resources that have extremely nondeterministic state. This is the same anti-pattern as having a shared development database.

Each developer should provide a set of mock messages for their service. Where do these mock messages live? At one extreme, you can place all mock-message flows in a common mocking service. All developers commit code to this service, but conflicts are rare because work isn’t likely to overlap. At the other extreme, you can provide a mock service along with each service implementation. The mock service is an extremely simple service that returns hardcoded responses.

The practical solution for most teams is somewhere in the middle. Start with a single universal mocking service, and apply the Split pattern whenever it becomes too unwieldy. Sets of services with a common focus will tend to get their own mocking service. The development environment is typically a small set of actual services, along with one or two mocking services. This minimizes the number of service processes needed on a developer machine.

The mock messages are defined by the developers building a given microservice. This has an unfortunate side effect: Some developers will focus on expected behavior. Others will use their service in unexpected ways, so the mocking will be incomplete. If you allow other developers to add mock messages to services they don’t own, then the mocks will quickly diverge from reality. The solution is to add captured messages to the list of sample messages. Capture sample message flows from the production or staging logs, and add them to the mock service. This can be done manually for even medium-sized systems.

Beware the distributed monolith!

How do you know you’re building a distributed monolith? If you need to run all, or most, of your services to get any development work done. If you can’t write a micro-service without needing all the other microservices running, then you have a problem.

It’s easy to end up needing to run a large cloud server instance for every developer—in which case development will slow to a crawl. This is why you must invest in a messaging abstraction layer and avoid the mini-web-servers anti-pattern.

You’ll need to think carefully about the mocking strategy you’ll use in your project. It must allow your developers to build with chosen subsets of the system.

5.8. Summary

Failure is inevitable and must be accepted. Starting from that perspective, you can work to distribute failure more evenly over time and avoid high-impact catastrophes.
Traditional approaches to software quality are predicated on a false belief in perfectionism. Enterprise software systems aren’t desktop calculators and won’t give the correct answer 100% of the time. The closer you want to get to 100%, the more expensive the system becomes.
The risk of failure is much higher than generally believed. Simple mathematical modeling of risk probabilities in component-based systems (such as enterprise software) brings this reality starkly into focus.
Microservices provide an opportunity to measure risk more accurately. This enables you to define acceptable error rates that your system must meet.
By packaging microservices into immutable units of deployment, you can define a set of deployment patterns that mitigate risk and can be automated for efficient management of your production systems.
The accurate measurement of risk enables the construction of a continuous delivery pipeline that enables developers to push changes to microservices to production at high velocity and high frequency, while maintaining acceptable risk levels.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Deployment

Create new playlist

Sign In

Sign Up

Chapter 5. Deployment

5.1. Things fall apart

5.2. Learning from history

5.2.1. Three Mile Island

Figure 5.1. High-level components of a nuclear reactor

Figure 5.2. A small subset of interactions between high and low levels in the reactor

The timeline of the accident

Learning from the accident

5.2.2. A model for failure in software systems

Figure 5.3. A single-component system, where P0 is the failure rate of component C0

A two-component system

Figure 5.4. A two-component system, where Pi is the failure rate of component Ci

Multiple components

Figure 5.5. Probability of system failure against the number of components, where all components are 99% reliable

Figure 5.6. A nonlinear four-component system

5.2.3. Redundancy doesn’t do what you think it does

5.2.4. Change is scary

Figure 5.7. A skewed estimator of failure probability

Figure 5.8. Estimated reliability of the system when four components change simultaneously

5.3. The centre cannot hold

Warning

5.3.1. The cost of perfect software

5.4. Anarchy works

5.5. Microservices and redundancy

5.6. Continuous delivery

5.6.1. Pipeline

5.6.2. Process

5.6.3. Protection

5.7. Running a microservice system

5.7.1. Immutability

Figure 5.9. Microservice diagram key

Rollback pattern

Figure 5.10. Rollback pattern sequence

Homeostasis pattern

Figure 5.11. Homeostasis pattern sequence

History pattern

Figure 5.12. History pattern sequence

5.7.2. Automation

Canary pattern

Figure 5.13. Canary pattern sequence

Progressive Canary pattern

Figure 5.14. Progressive Canary pattern sequence

Bake pattern

Figure 5.15. Bake pattern sequence

Merge pattern

Figure 5.16. Merge pattern sequence

Split pattern

Figure 5.17. Split pattern sequence

5.7.3. Resilience

Slow downstream

Upstream overload

Lost actions

Poison messages

Guaranteed delivery

Emergent behavior

Catastrophic collapse

Apoptosis pattern

Figure 5.18. Apoptosis pattern sequence

Mitosis pattern

Figure 5.19. Mitosis pattern sequence

Kill Switch pattern

Figure 5.20. Kill Switch pattern sequence

5.7.4. Validation

Version Update pattern

Figure 5.21. Version Update pattern sequence

Chaos pattern

5.7.5. Discovery

Table 5.1. Service-discovery methods

5.7.6. Configuration

5.7.7. Security

5.7.8. Staging

5.7.9. Development

5.8. Summary

Table of Contents for
Chapter 5. Deployment

Figure 5.3. A single-component system, where P₀ is the failure rate of component C₀

Figure 5.4. A two-component system, where P_i is the failure rate of component C_i