Chapter 6. Measurement

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Measurement

This chapter covers

Monitoring monoliths versus microservices
Monitoring systems with many small parts
Using scatterplots to visualize system behavior
Measuring messages and services
Using invariants to validate system health

Microservice systems need to be monitored. More than that, they need to be measured. The traditional approach to monitoring, where you collect time-series metrics such as CPU load and query response times, isn’t as useful when you have a system of many small moving parts. There are too many microservices to think about each one individually.

Instead, we must widen the concept of system observation. You want to measure the system, rather than monitor it. A microservice system grows organically as business needs change, and taking a measurement-oriented approach allows you to discover and understand the system as it is, rather than how it was designed. Such measurements aren’t limited to the dimension of time. You also want to measure the ever-changing network of relationships within the system.

As always, you need to ask where the value is. What purpose does measurement serve? You want to do three things:

Validate the business requirements— You want a set of measurements that demonstrate progress toward your business goals.
Verify and understand the technical functioning of the system— You want to make sure things are working as they should and will continue to do so.
Manage risk so you can move fast— You want to be able to make rapid changes to the system without breaking it.

We’ll use the microblogging system from chapter 1 to demonstrate how to apply and visualize measurements.^[1]

¹
It’s worth returning to chapter 1 and reviewing the full microservice diagram (figure 1.5) for the microblogging system. Only the relevant subsections of that diagram will be shown in this chapter.

6.1. The limits of traditional monitoring

The monitoring typically used for monolithic systems is more properly called telemetry. It’s focused on metrics per server. The metrics are mostly time series, with a few capacity checks thrown in. You measure the CPU and memory loads on your machines (even if virtual), network traffic, and disk space. You also measure response times on web service endpoints, and you log slow queries.

When you’re load balancing over a small number of application servers, each using one primary database and a few secondary databases for reading, these metrics are good enough. When you deploy a new version of the monolith, it’s pretty obvious when there’s a problem, because response times and error rates crash and spike. You can easily connect each chart with the system element responsible for the problem. Diagnosis is simple because there are only a few parts.

It’s the database index

When an application that was working fine yesterday suddenly starts to misbehave, it’s almost certainly because you need to index a database column. This is the first thing to look for and the first thing to eliminate as a cause.

An application searches many database columns as part of its core logic. Often these are primary keys, which are always indexed, so increasing numbers of records won’t cause issues. You also explicitly index columns that you know will be used in queries.

But it’s easy to miss some columns that have a critical impact on performance. Often it becomes clear only after you’ve been running in production for a while, and accumulating data, that there’s a dependency on certain columns. The database reaches a tipping point, and performance suddenly declines.

Unfortunately, this isn’t a problem microservices can solve. The same forces and issues apply. When the behavior of a microservice with data responsibility declines, check your indexes!

6.1.1. Classical configurations

In the microblogging system from chapter 1, a core feature is the display of the user’s timeline. The timeline for a user is all the entries the user should see from all the people they follow. The responsiveness of the timeline is vital to the correct functioning of the system. Any increase in latency due to an issue in the system is something you want to know about.

For a moment, imagine the microblogging system as a monolith. Perhaps you have eight application servers running all the functionality. Then, you collect response times for the timeline query and store them as time series data. For each query, you store the time it occurred and how long it took to return a result. You do this for all eight application servers and use a time series database and an associated monitoring tool^[2] to generate charts showing response times over time. You can show the average response times of each server. If a server is having issues, you should be able to pick this up from the charts, because you can see the change over time. In the charts in figure 6.1, you can see that the average response time of server A shows a problem.^[3]

²
Perhaps a commercial solution such as New Relic (https://newrelic.com) or an open source solution such as Graphite (https://graphiteapp.org).

³
The charts in this chapter were generated using Python data-science tools, in particular seaborn (http://seaborn.pydata.org). The data is simulated to be clear and to help make pedagogical points. Your real data will be messier.

Figure 6.1. Classical response-time charts

Time series data

Time series data is shown as the change in a measure over time. The measure might be the number of inbound requests or the response time of those requests. You’re presented with a nice line chart that has time on the horizontal axis and the value on the vertical axis.

It’s important to understand what you see. Each point on the chart is not a specific event: it’s a summary of a set of events over a given time period. To draw the chart on a per-second basis, all the events in each second are averaged, and the average value is used to draw the chart.

This isn’t the only way to do it. You can use other summary statistics, such as the median (the halfway point), or a rolling average that averages over the last 30 seconds, say.

The underlying data is a set of values at specific time points. Usually, there’s too much data to store or send over the network. So, most analytics solutions aggregate the data before sending it from the client to a time series database. The time series database then summarizes older data to reduce storage needs.

Time series charts are a useful tool, but you should bear in mind that quite a bit of magic is going on behind the scenes.

Now, consider this approach from the perspective of a production microservice system. Many types of microservices are collaborating to produce the timeline, and there are many instances of each microservice type. There can easily be hundreds of response-time charts to review. This isn’t feasible. The traditional time series approach to system measurement doesn’t give you the same insights when it comes to microservices. You need to adopt a measurement approach that can handle large numbers of independent elements.

Why not just use alerts?

The discussion in this chapter is focused on diagnosing issues and understanding the structure of a microservice system. Once you understand an underlying issue and are happy that the system in production matches your design, you can monitor important measures directly.

You can use these measures to define and generate alerts when thresholds are crossed. You can use these alerts to scale the system up and down and send pages when there are critical failures.

The challenge with alerts is to get the sensitivity right. If there are too many failure alerts that aren’t really critical, whoever’s on call that week will start ignoring them. Calibrating scale-up and scale-down is often achieved by (expensive!) trial and error.

On top of all that, by using microservices, you place yourself in a situation where many people are deploying to production (that’s what you want!), and the structure and relationships between system elements are in constant flux. The ability to understand the system as it is becomes far more important as a result. This chapter takes for granted that you apply traditional monitoring techniques, and focuses instead on diagnostic interpretation to understand the complexity inherent in a network of many intercommunicating microservices.

6.1.2. The problem with averages

There are some improvements to be made to the basic time series charts. We won’t abandon time series charts entirely, because they’re useful when used appropriately, but charting the average response time isn’t ideal. An average, also known as the mean, is a summary statistic, which by definition hides information. If response times were distributed evenly and tightly around the average, then it would be a good indication of what you care about—the user’s experience of performance. The problem is that response times have a skewed distribution.^[4] They don’t keep close to the average. Some response times are much higher, and some users may experience very low performance despite the average response time appearing to be acceptable. If you sort the response times into buckets of 50 ms and then chart the number of responses in each bucket, you end up with the histogram chart shown in figure 6.2.^[5]

⁴
The average, or mean, is a great summary of numbers that come from the normal distribution. The normal distribution is a mathematical model of randomness that assumes measured values will be close to, and balanced around, some central “true” value.

⁵
A histogram shows how many items occur in each category of interest. You can construct the categories from numeric ranges to organize the response-time data. This lets you see which response times are more prevalent than others.

Figure 6.2. Histogram chart showing response times

Suppose that, in an attempt to improve performance, you decide to add a caching mechanism. The average comes down, and you’re delighted. But customers still keep complaining about performance. Why? The cache has made about half your requests much faster, but the other requests perform as before, and you still have a small number of requests that are really slow. The average now sits in the middle, as shown in figure 6.3, and doesn’t describe the experience of most users: only a small set of users experience “average” performance.

Figure 6.3. Histogram chart showing response times, with caching causing two peaks

There are other statistics to consider. You want to better characterize the experience of most users so you can know that system performance is acceptable. The median is the value that divides the data in half. If you take the median response time, you know that half the users experienced a faster response time and half experienced a slower response time. This is more meaningful than the average, but the median still doesn’t tell you that some users see very bad performance. You need to know how many people are having a bad experience. Some always will, but you need to know how many so that you can set an acceptable error rate.^[6]

⁶
It doesn’t make sense to build a system that has perfect performance for all users. It’s possible to build such a system, but the cost doesn’t justify the business benefit. There’s always an acceptable level of failure that balances cost with business objectives. For more on this principle, see chapter 8.

6.1.3. Using percentiles

One useful solution to this problem is to start from the perspective of business value. What percentage of users can experience poor performance without impacting the business severely? It’s only worth spending money on more servers to improve performance up to this point. This is often a subjective judgment, and an arbitrary answer of 10%, 5%, or 1% is chosen. Regardless of how the figure is determined, you can use it to define your goal in terms of response times. If you decide that responses times should be at most 1 second and that at most 10% of users should experience performance slower than 1 second, then you can invert the percentage to ask the question in a more convenient manner: what was the response time that 90% of users didn’t exceed? This response time should be at or below 1 second to meet your performance requirement. It’s known as the 90thpercentile.^[7]

⁷
To calculate a percentile, take all of your data points, sort them in ascending order, and then take the value that’s at index (n × p / 100) – 1, where n is the number of values and p is the percentile. For example, the 90th percentile of {11,22,33,44,55,66,77,88,99,111} is 99 (index is 8 == (10 × 90 / 100) – 1). Intuitively, 90% of values are at or below 99.

Percentiles are useful because they align better with business needs. You want most customers to have a good experience. By charting the percentile, rather than the average, you can directly measure this, and you can do so in a way that’s independent of the distribution of the data. This handles the caching scenario (where you had two user experience clusters) and still provides a useful summary statistic.

Figures 6.4 and 6.5 add the 90th percentile to the previous histograms of response times. Although caching improves the average response time, you can see that the 90th percentile doesn’t improve: 10% of responses are still greater than about 680 ms.

Figure 6.4. Histogram chart (no cache) showing response times, with 90th percentile

Figure 6.5. Histogram chart (with cache) showing response times, with 90th percentile

Let’s consider a failure scenario in the context of a monolith and see how percentiles can help. Suppose you have a system with tens of servers at most. One of these servers is experiencing problems with a specific API endpoint. In your daily review of the system metrics, the response-times chart for this API endpoint for the server in difficulty looks like figure 6.6.

Figure 6.6. Time series chart of average and 90th percentile response times

This chart shows response time over time. For each time unit, the chart calculates the average and the 90th percentile. To help you understand how the underlying data is distributed, each individual response time is also shown as a gray dot (this isn’t something normally shown by analytics solutions). By comparing historical performance to current performance, you can see that there was a change for the worse. By using percentiles, you avoided missing the problem, because the average doesn’t show it unambiguously. This approach will work when you’re reviewing a small number of servers, but it clearly won’t scale to microservices.

Summary statistics are terrible

A summary statistic creates one number out of many. It’s meant to give you a feel for the larger dataset of numbers by condensing them into a single number. This can be misleading, because information is lost. In particular, the shape of the dataset is lost—but the way the data is distributed can be just as important as the average value.

A famous example is Anscombe’s quartet: four datasets of x and y values that have the same summary statistics, despite being very different. The average of the x values is always 9, and of the y values is always 7.5.

Anscombe’s quartet: the averages of x and y are the same in each dataset.

These datasets also have the same values for more-technical summary statistics such as variance. Anscombe’s quartet shows the importance of visualizing the measurements from your microservice system, rather than just summarizing them.

Percentiles are better than averages for understanding system performance, but you shouldn’t be seduced by them, either. They’re still summary statistics, and they hide information. Nor should you be naïve about the cost of calculating them (lots of data points must be sorted). Many analytics solutions give you a percentile estimate, rather than the true value. Read the fine print.

6.1.4. Microservice configurations

Microservices create some difficult problems for the traditional approach to metrics. The network includes an order of magnitude more elements. You’ll have hundreds of microservice instances in a production system, multiple instances of each kind of service, and several different versions that live in production at the same time. You’ll probably use containers on top of virtual machines.

Although you’ll always want to have time series metrics such as CPU load at a fine-grained level, for each microservice, it’s infeasible to review all of them. When something goes wrong—when the response time of an important web-service endpoint degrades—where do you start looking? You don’t want to have to laboriously open each microservice to check its metrics.

Here are the measurement problems you need to solve:

Observing the health of the system without needing to observe each component
Understanding the structure of the system as it is in reality
Diagnosing and isolating faults quickly and easily
Predicting where problems will arise

Somehow, you must summarize the state of the system into a set of useful measurements. The problem with time series charts is that you must either plot data for all the services, leading to a noisy chart that’s impossible to decipher, or statistically summarize the data (using means or percentiles) and lose resolution on the aberrations you’re looking for.

The scatterplot is an alternative visualization that’s better suited to microservice architectures. It’s suitable for analyzing the relationships of large numbers of elements—precisely the circumstance you’re in.

6.1.5. The power of scatterplots

In the case of the monolith, you noticed that something was awry by comparing historical and current response times. You did this visually: you saw the line in the chart curve upward. The chart showed performance over time, and it was easy to adjust to show the historical and current response times. You could compare the current problematic response times to earlier times when performance was healthy. This comparison is the key to identifying the performance problem. You can use scatterplots to perform the same comparison over hundreds of microservices, all on the same chart.

A scatterplot is a way to visually compare two quantities. You have a list of things, such as servers, messages, or microservices, and each thing becomes a dot on the chart. Each thing should have two numerical attributes that you want to compare: one for the x-axis and one for the y-axis. A classic scatterplot example compares the weights and heights of a group of people (see figure 6.7): you expect taller people to be heavier, so the dots form a shape that tends upward and to the right. This shows that the two quantities are correlated.^[8]

⁸
This public domain data is sampled from the following report: Anthropometric Reference Data for Children and Adults: United States, 2007–2010, National Center for Health Statistics, Centers for Disease Control and Prevention. Scatterplots are often used to show a correlation between two variables in a scientific study in order to investigate whether one variable causes changes in the other. Correlation by itself can’t do this because it shows only the relationship between the variables. An underlying scientific theory is needed to argue for causality. For our purposes, in the muck and grime of production software systems, we’re mostly concerned with the relationship as a diagnostic tool, rather than demonstrating causality.

Figure 6.7. Scatterplot of weight (in kilograms) versus height (in centimeters)

You need two numerical attributes, and that presents a problem for time series data, because you have only one: the value of the measurement at a given time.^[9] How can you define two numbers to describe response times so that you can compare current and historical behavior for each microservice? The answer is in the statement of the question—you use summary statistics for a historical period and for the current period.

⁹
Just plotting over time gives ... a time series.

Let’s use response times over the last 24 hours as the historical data and response times over the last 10 minutes as the current data.^[10] You could summarize the data using the average response time, but as you’ve seen, this isn’t a useful statistic when you’re interested in the experience of the majority of users. Instead, let’s use the 90th percentile. If you have 100 microservices, you can calculate these numbers for each one and then chart the scatterplot, as shown in figure 6.8.

¹⁰
Adjust the historical and current time ranges as needed for your own data and system.

Figure 6.8. Current-behavior scatterplot of microservice response times

If everything is working as expected, then current performance should resemble historical performance. The response times should be highly correlated and form a nice, obvious upward line in the chart. In the figure, there’s one outlier, and you can easily identify the errant server.^[11] This scatterplot is a great way to visualize current behavior.

¹¹
In this context, “easily” means you’re using a charting library that can interactively identify data points.

The scatterplot is a historical snapshot of a single point in time, and this is a significant difference from a time series chart. It can be useful to generate a series of scatterplots over time and then animate them together, showing the system changing over time.

6.1.6. Building a dashboard

It’s strongly recommended that you sign up for a third-party analytics service and use it to collect measurements. A wide range of competent solutions are available on the market.^[12] Unfortunately, most of them are focused on the monolithic architecture. This means they don’t handle large numbers of network elements well and focus primarily on time series analysis. But this is still useful, and these aspects of your microservice system are important.

¹²
Commercial solutions include New Relic (https://newrelic.com), AppDynamics (www.appdynamics.com), Datadog (www.datadoghq.com), and similar.

To fully measure your system, and to make the measurement techniques discussed in this chapter available to your teams, you’ll need to consider building a small custom solution. Over time, more and better support for microservices will appear in the analytics services, reducing the amount of custom work you’ll need to do. As with deployment, choosing microservices means a commitment to custom tooling in the interim, and it’s important to factor this into your decision to use microservices.

Unfortunately, most of the open source analytics and dashboard solutions also focus on traditional use cases. You’ll need to choose a set of open source components that let you build custom data processing and custom charting relatively easily.^[13] Don’t be afraid to build “good-enough” dashboards. You can achieve a great deal by loading raw data (suitably summarized) into the browser and generating charts by hand with a charting library.

¹³
Try InfluxDB (www.influxdata.com), Graphite (https://graphiteapp.org), or Prometheus (https://prometheus.io).

You also don’t need to build an interactive dashboard. An increasing number of data science tools are available, both commercial and open source, that allow you to analyze data and generate reports^[14] using scatterplots. This might be a manual process initially, but it can easily be automated if necessary. The reports won’t be real-time and won’t help with that class of faults. Many faults aren’t real-time, however. They’re persistent issues—ongoing problems you’re living with. Data science tools are useful for diagnosing and understanding these types of problems.

¹⁴
Data science tools can be effective and are worth taking a little time to learn. A good place to start is Anaconda (http://continuum.io), which is a curated package of Python, R, and Scala tooling.

6.2. Measurements for microservices

You need a structure to define the measurements for a microservice system. You can use the pathway from business requirements to messages to services that we’ve already defined; this gives you three layers of measurement.

6.2.1. The business layer

Business requirements tend to be both qualitative and quantitative. The qualitative ones describe user experiences in terms of workflows, capabilities, and subjective experiences. The quantitative requirements, which are often fewer in number, tend to focus on easily quantified business goals, performance levels, system capacity, and, if we’ve made our point, acceptable error rates.

Your dashboard absolutely must capture and display these quantified business metrics, and capturing them shouldn’t be an afterthought. (Chapter 7 discusses the importance of these metrics to project success.) From a technical perspective, you can simplify capturing these metrics by using the flow of messages in the system. For example, the conversion rate of an e-commerce application is the ratio of confirmation to checkout messages.^[15] Not all metrics can be captured in terms of messages, but you should use message analytics to the fullest extent possible.

¹⁵
This shouldn’t be your primary authoritative measure of conversions. Message-flow rates should be used as a first-pass confirmation that you’re hitting your targets.

What about the qualitative aspects of the business requirements? Aren’t these impossible to measure?^[16] The answer is an emphatic no! Anything can be measured, in the sense that you can always reduce your uncertainty about something, even with indirect measurements. For example, workflows through the system are represented by message flows through microservices. By tracing the causal flow of messages through the system, you can verify that workflows are being followed as designed. This is a measure of the alignment of the system with the business analysis that produced it. Instead of trying to enforce correctness, you accept that there will be deviations in practice, and you find out what they are. Maybe the workflows that get built in practice are better than the original designs. Maybe they failed to understand a key business driver, and you need to correct them.

¹⁶
For a pragmatic perspective on system measurement, see How to Measure Anything by Douglas Hubbard (Wiley, 2014). It’s a highly recommended read.

How do you measure a goal like “The system should be user-friendly”? Here’s one approach: measure goal completion effort.^[17] The business requirements define sets of user goals, such as “find a product,” “perform a checkout,” “sign up for a newsletter,” and so on. How much effort does it take the user in terms of time and interactions to achieve these goals? Although these interactions differ from workflows in that they’re undirected and the user can achieve the goal state via many paths, you can still use message flows as a proxy for measuring effort. In this case, rather than a causal trail through the messages, you track the flow by another attribute, such as the user identifier.

¹⁷
There are other approaches that you should consider, such as user surveys.

The perspective to take with the qualitative requirements is that you can often recode them in terms of message interactions and then measure the message interactions as a proxy for the qualitative requirement. For example, a completed order must generate a confirmation email and an instruction for the warehouse. The counts of the messages representing these business activities should correlate. Once again, you use messages as a representation of many aspects of the system. This is an advantage of the microservice architecture that isn’t immediately apparent and is easier to achieve once you take a message-oriented, rather than service-oriented, perspective.

6.2.2. The message layer

At the message layer, you can use a uniform set of measurements against each message. By reducing most interactions to message flows, you can analyze everything using the same tools. This lets you validate not only the business requirements, but also the correctness and good behavior of the system. Messages have causal relationships with each other by design, and you can verify that the system is adhering to these desired relationships by watching what the messages do in production.

Consider a message from service A to service B. You know that when this message leaves A, it can be synchronous or asynchronous. When it arrives at B, it can be consumed or observed. You can use this model to develop a set of measurements for messages. Let’s start at the top and work our way down.

Service instances and types

A service instance is a specific operating system process. It may be running in a container on a virtual machine, or it may be running as a bare process on the host system of a developer laptop. There may be hundreds of service instances in a microservice system.

A service type is a shorthand expression for a group of service instances. The type can be a common message pattern that the service instances send or receive, or it can be a deployment configuration. Services can be tagged with a name, and the type may be all versions of service instances with the same tag.

It can be useful to define measurements against a set of service types so that you can understand the behavior of groups of services. The definition of the types is determined by your needs and the architecture of your system. The ability of your analytics solution to handle groupings like this is important and should be a consideration in your choice of analytics system.

Universal measurements

There are universal measurements that apply to all messages. The number of messages sent per unit of time is a measure that’s always applicable. How many login messages per second? How many blog posts per minute? How many records loaded per hour? In general, if you capture the message-flow rate as messages per second, for each message pattern in your system, you have a great starting point for analysis.

Message-flow rates measure load on your system directly, and they do so independently of the number of services, which is a useful advantage as a measurement. Charting them as a time series per message allows you to see characteristic behavior over time, such as usage spikes at lunchtime. You can also trigger alarms if the flow rates go outside expected levels, or use the flow rates to trigger scaling by provisioning more servers. Rather than use indirect metrics, such as response times or CPU levels, you can use message-flow rates as a direct measure of load on the system.

The message-flow rate is by far the most important universal measurement, but there are others you’ll want to capture as well. How many messages of each pattern are sent on a historical basis? How many messages have errors, and what’s the error rate? What size are the messages? You can use these metrics to develop a historical perspective on your system, allowing you to verify that changes are having the desired effects and to diagnose long-running issues.

Choosing an analytics solution for microservices

It’s unlikely that you’ll be able to use a single analytics solution for your microservice system. Certainly, you can use commercial services where doing so makes sense. Such things as API endpoint performance, client-side error capture, page-load times, mobile session durations, and so forth remain relevant whatever your underlying architecture.

To capture measurements that tell you about your microservice system, you’ll need to do some custom work. As time goes by, analytics vendors and open source projects will provide better support for architecture with large numbers of elements. Unfortunately, it’s likely that the emphasis will still be on services, rather than messages.

You can use the message abstraction layer to capture the data you need. Integrate distributed tracing at this point.^[18] You can also capture message counts and flow rates. It’s unwise to send the raw data to an analytics collection point, because the volumes will be too high. Instead, you’ll have to summarize or sample the data. A summary might be the metric “messages seen per minute,” sent once a minute. A sample might be to capture the details of 1% of messages.

¹⁸
An important paper to read on distributed tracing is “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” by Benjamin H. Sigelman et al., Google Research, 2010, https://research.google.com/pubs/pub36356.html. A good open source implementation is Zipkin (http://zipkin.io).

You’ll then store these measures using a time series database and use the data in that database to generate a custom dashboard. You can read this chapter as a high-level description of what you’ll need to build.

Measuring synchronous and consumed messages

You can establish well-defined measures for synchronous and consumed messages, some of which you can reuse for other message types. Suppose a message leaves service A at a time local to A, and you want to capture this event using your analytics system. You want to count the number of messages of each pattern that service A emits and calculate the message-flow rates for that service instance and service type and for the entire system. The analytics system should combine all the reports from all services so that you can get the total counts for each message pattern.

Note

The following sections of this chapter, outlining the basic ways to measure messages, should be considered reference material for building your microservice analytics and as a starting point for your own fine-grained, context-specific metrics. Please feel free, as with the other reference sections in this book (such as the list of message patterns in section 3.5), to skim on first reading.

Your analytics solution should allow you to aggregate message events so that you can calculate message counts and timing means and percentiles over periods of time that are of interest to you. Message-flow rates are most useful on a per-second basis, because you want to react quickly to changes. Message counts and timings can use longer time periods, from minutes to hours to days. On this basis, you can define the measures per message instance per pattern, shown in table 6.1.

Table 6.1. Outbound message measures

Case	Measure	Type	Aggregation	Description
outbound-send	count	event	Count over a time period.	Capture the number of messages sent.
outbound-send	pass	event	Count over a time period.	Capture the number of messages successfully sent.
outbound-response	count	event	Count over a time period.	Capture the number of message responses received.
outbound-response	pass	event	Count over a time period.	Capture the number of successful message responses.
outbound-response	time	duration	Time taken as mean or percentile over message-response times.	Capture the response time.

You use outbound-send/count to see how many messages are being sent. Each message is a data point, and the count is a count of the data points. It’s useful to know whether messages are being successfully sent, independent of successful receipt or processing—this is captured by outbound-send/pass. This gives you the perspective of the sending service with respect to network health.^[19]

¹⁹
One useful derivative is outbound-send/fail, calculated by subtracting passes from the total count. A useful extension is outbound-send/wtf, where the nature of the error response is wholly unexpected. This can indicate potentially catastrophic edge cases in production that you need to handle more deliberately.

Here’s an example scenario. You’re using intelligent load balancing to route messages based on their pattern. One of the load balancers is intermittently defective, so sending services sometimes can’t contact it. In this case, outbound-send/pass will be less than 100% of outbound-send/count. A useful chart, therefore, is the ratio of these two measures. Because the aggregation is a count function, it’s independent of the number of services and the subdivision of patterns, so you don’t lose information by aggregating all the behavior in the system into more-general charts.

The outbound-response/count measure captures the number of message responses received. If everything is working, then outbound-send/count and outbound-response/count should be closely correlated, with some jitter due to network traversal time. You can generate a scatterplot to verify this; see figure 6.9.

Figure 6.9. Synchronous message outbound send and response counts

In the scatterplot, you can see three message patterns that are behaving oddly. Each message pattern is plotted using a different marker shape so you can tell them apart.^[20] The upward-arrow marker shape indicates that the associated message pattern is receiving no responses. Clearly, something is broken. The left-arrow marker shape shows response counts well below send counts, indicating a situation that isn’t healthy. Finally, the downward-arrow marker shape shows responses far above sends. This happens when you add capacity to clear a backlog. This scatterplot is sensitive to the time period it represents, so you should plot it over several orders of magnitude.^[21]

²⁰
Shapes are used here because in a book, the chart isn’t interactive!

²¹
If messages are backlogged in a queue, then even though the send and response counts might be correlated, the responses are for older messages, and latency is probably unacceptably high, even if throughput is OK.

The responses to synchronous messages can be error responses, meaning the receiving side failed to process the message and is sending back an error response. You can track the success rate of responses with outbound-response/pass and derive the error rate from that. Bear in mind that you need to calibrate the time period. It can also be useful to classify the type of error response. Was it a timeout, a system error on the receiving side, a badly formed message, or something else? You can then use these classifications to drill down into the failure events. Charting the error rates over time is a good way to get a quick view of system health; see figure 6.10. Error counts should be low, so you can take the shortcut of charting them all together, because problematic services will stand out.

Figure 6.10. Outbound response failure counts over time

The outbound-response/time measure captures the response time of the message. This is the total response time, including network traversal and processing time on the receiver. You can use this measure to identify messages and services that have degraded performance. This measure must be aggregated over a set of messages. As discussed earlier, the most desirable aggregation is the percentile. Sadly, this can be expensive to compute, because the data points have to be sorted, which is prohibitive for large datasets. Some analytics tools let you estimate the percentile by sampling, and this is still a better measure than the mean. The mean is easier and much faster to calculate, so you may decide to use it for interpretation with appropriate care.

The outbound-response/time measure is useful for generating current-behavior scatterplots. You can see whether the current response times are healthy compared to historical behavior. The scatterplot can show response times for each message, or you can aggregate over message types using pattern matching. The scatterplot in figure 6.11 collects all message types into 20 major families (taking the average within each family). The downward arrow indicates a family of messages that might be of concern.

Figure 6.11. Current response-time behavior by message family

The receiving side of the synchronous-consumed interaction also provides some measures that are important, as listed in table 6.2.

Table 6.2. Inbound message measures

Case	Measure	Type	Aggregation	Description
inbound-receive	count	event	Count over a time period.	Capture the number of messages received.
inbound-receive	valid	event	Count over a time period.	Capture the number of received messages that are valid and well-formed.
inbound-receive	pass	event	Count over a time period.	Capture the number of received messages that are successfully processed.
inbound-response	count	event	Count over a time period.	Capture the number of responses sent.
inbound-response	pass	event	Count over a time period.	Capture the number of successful responses.
inbound-response	time	duration	Time taken to process the message, as mean or percentile over message-response times.	Capture the processing time.

The number of inbound messages is captured by inbound-receive/count. You can compare this with outbound-send/count to verify that messages are getting through. Again, this is independent of the number of services or the rate of messages. The number of responses is captured with inbound-response/count; you can compare this with outbound-response/count. The success or failure status of the message is captured with inbound-response/pass. Again, you should classify failures: Was the sender unreachable? Was there a processing error? Did you send a message to a service you depended on, that failed? Use inbound-receive/valid to count separately the number of messages received that aren’t invalid or badly formed; this is useful to detect issues with services that send messages to you.

The processing time for messages is captured using the inbound-response/time measure. As with outbound-response/time, this needs to be aggregated over time using a mean or percentile, and the same issues with interpretation and estimation apply. You can also use scatterplots for this measure to view message health.

There’s a useful scatterplot you can generate if you compare outbound-response/time and inbound-response/time; see figure 6.12. This shows you how much network latency affects your messages, and which messages may be slower due to size or routing issues.

Figure 6.12. Different message transports have different delivery speeds.

In this scatterplot, you can see that most messages incur an additional 200 ms or so of network traversal. This includes not only transmission time, but also routing and parsing time through the message abstraction layer. A group of messages is faster than this, incurring only 100 ms. This is probably a different, faster, message transport. Each message transport will have its own performance characteristics that become evident when shown this way. Such a scatterplot can guide your intuition about what system behavior “should” look like.

Measuring synchronous and observed messages

This message interaction describes the scenario where you have additional services observing a primary synchronous interaction. For example, you might have an auditing service, you might be capturing event flows for review, or you might have secondary business logic. From the perspective of the sender, the additional services are invisible, so there are no measurements to capture. From the perspective of the additional observers, you can capture the measurements listed in table 6.3.

Table 6.3. Inbound message measures

Case	Measure	Type	Aggregation	Description
inbound-receive	count	event	Count over a time period.	Capture the number of messages received.
inbound-receive	pass	event	Count over a time period.	Capture the number of received messages that are successfully processed.

These are used in the same manner as inbound messages to the primary receiver. You can compare the measures to the primary sender, and to the primary receiver to verify that messages are all being delivered and processed properly. Be careful to compare only on a per-observer-type basis, because each observer will receive a copy of the message, and thus the measures aren’t independent of the number of observer types.

Measuring asynchronous and observed messages

This is the classic fire-and-forget pattern. You emit events into the world as messages, and you don’t care who receives them or what they do. The measurements are thus all on the sending side; see table 6.4.

Table 6.4. Outbound message measures

Case	Measure	Type	Aggregation	Description
outbound-send	count	event	Count over a time period.	Capture the number of messages sent.
outbound-send	pass	event	Count over a time period.	Capture the number of messages successfully sent.

These metrics mirror those for the synchronous case. There’s no set of measures for the responses, because there are no responses. Be careful to count outbound-send/pass only for each message, not for each receiver. This retains the utility of the measure as something you can use for comparisons, independent of the number of services.

On the inbound side, you can capture measures on a per-service-type basis; see table 6.5.

Table 6.5. Inbound message measures

Case	Measure	Type	Aggregation	Description
inbound-receive	count	event	Count over a time period.	Capture the number of messages received.
inbound-receive	pass	event	Count over a time period.	Capture the number received of messages that are successfully processed.

As with observed synchronous messages, observed asynchronous messages need to be aggregated on a per-observer-type basis for comparison purposes.

Measuring asynchronous and consumed messages

This interaction is almost the same as the synchronous-observed interaction, except that only one service consumes the message. Thus, you can compare the measures directly without worrying about multiple receivers.

6.2.3. The service layer

In practice, you’ll capture the message measurements directly within each service, using the data-collection framework of your analytics system. That means you can also analyze these metrics down to the service, giving you a perspective on individual service instances.

This isn’t as useful as it sounds. Individual service instances are ephemeral in a microservice architecture, and you monitor them precisely because you want to quickly eradicate problematic service instances and replace them with healthier ones. So, it’s more useful to think in terms of families of service types and perform your analysis at that level. In particular, this allows you to identify problems introduced by new versions of services, ideally during the rollout of a Progressive Canary (discussed in chapter 5) so that you can roll back to a known-good version if it proves problematic.

You should still collect all service-instance telemetry. This is important supplementary data for debugging and understanding production issues. Fortunately, most analytics systems are focused on collecting this data, so you won’t need to do much, if any, custom work, even when you’re running thousands of microservices. Your message metadata should include service-instance identifiers so that you can match up the data later.

Where’s the problem?

If a new version of a service is causing problems, you want to know about it. Again, by “problems” I mean poorer-than-expected behavior in terms of performance or correctness, shorter lifetimes, and other issues that aren’t immediately fatal. These nonfatal deviations are an inevitable consequence of building software quickly; but if you’re unable to resolve them in a timely fashion, they’ll remain in the system to accumulate. They’re a form of technical debt.

Let’s take a service perspective, rather than a message perspective, for a change. Each service type has a set of messages that arrive from upstream services and a set of messages that are sent to downstream services. When there’s a problem, it’s natural to ask whether the location of the problem is upstream from, local to, or downstream from the service under investigation.

Upstream problems

The inbound-receive/count will change if upstream services are in trouble. They may be sending too few messages or sending too many. You can generate a current-behavior scatterplot for service types over the inbound-receive/count to observe these changes. You should also check inbound-receive/valid to make sure you aren’t receiving invalid messages.

Local problems

The problem could be local to your service. If your service is unable to handle the load, you’ll see aberrations in the inbound-receive/count and inbound-response/count relationship, which you can detect via a scatterplot.

Automated scaling should prevent this from occurring, so you have work to do to figure out why the scaling isn’t doing its job. Sometimes the load grows too quickly.

If the inbound-receive/count and inbound-response/count relationship looks OK, then you need to check for elevated levels of message-processing errors using inbound-response/pass. You can use a current-behavior scatterplot to find the errant service types and drill down using the time series for those services.

Perhaps your service is just slow. In that case, you can use a current-behavior scatterplot over inbound-response/time to find aberrations, and time series charts to examine individual service types.

Finally, your responses may not be getting through. If you’re using separate transport channels for requests and responses,^[22] this is entirely possible. Use inbound-response/pass to catch this.

²²
Some transport configurations use a message bus for outbound messages and then use direct HTTP for the response. This is to avoid churn on the response queues.

Downstream problems

Your service may be returning errors or running slowly only because something it depends on is behaving badly. If you rely on downstream services, you need to validate the health of those services from your perspective. They may be the cause of the errors or slowness.

Use the outbound-send family of measures to diagnose downstream problems. If you can’t reach downstream services, then the correlation between outbound-send/count and outbound-send/pass will degrade, and you can catch this using a current-behavior scatterplot.

If you can reach downstream services but they’re having difficulties, then outbound-response/pass will deviate. Perhaps a new data field on a message has triggered an unforeseen bug, and the downstream services can only return error responses.

If downstream services are slow, then the outbound-response/time measure will deviate. You can use the usual scatterplot followed by time series analysis.

Service-instance health

Sometimes it’s necessary to review the health of an individual service instance. This review is at the level of service-instance processes or, more typically, containers. You should anticipate having hundreds of service instances, so the current-behavior scatterplot is a good place to start. But sometimes it isn’t sufficient, because you need to review behavior over a period of time for many services. You could plot them all on a time series chart, but as you’ve seen, this is too noisy for large numbers of services.

One solution is to use a categorical scatterplot.^[23] This shows all events in a given time period for each member of a category. In this case, this would be per service instance. The advantage of such a plot is that you can compare services over time in a clear way, and it works well up to tens of services. In practice, this limit isn’t a problem, because you’ll generally want to focus on the instances within service families.

²³
There’s no second numerical attribute, because categories aren’t numbers, so the standard scatterplot won’t work.

Let’s construct a scenario. Suppose you have 20 instances of a service running. This service uses local storage as part of its business logic. Each instance runs on a container in a virtual machine, and the orchestration and deployment system allocates the service containers automatically. First, you chart the current behavior with respect to response time, but it isn’t clear where the problem lies, because the services are behaving relatively consistently (see figure 6.13). If any service is slow, it’s consistently slow.

Figure 6.13. Current behavior of service instances of the same type

Now, let’s use the categorical scatterplot. For each service, plot the response times for all messages over a given time period (see figure 6.14). Each dot represents a message, and the vertical axis is the response time. The horizontal axis lists the service instances; it’s a category, not a number. To make the data easier to see, you add some horizontal jitter to the dots so they don’t obscure each other as much.

Figure 6.14. Categorical scatterplot of service-instance response times

In this chart, you can see that service s03 has poor performance. Further investigation shows that the container for this service has been allocated to a virtual machine that’s running out of memory and swapping to disk. Now you know where to focus your efforts.

The categorical scatterplot is useful not only for service instances, but also for any categories within the system, such as service types, versions, or message patterns. It provides another way to understand your system.

6.3. The power of invariants

Message-flow rates are a useful tool for determining the health of the system, but they have issues that make them difficult to use directly. Consider user logins. Each user login generates a message. You can track these messages and calculate the flow rate. The problem is that the flow rate fluctuates over time: there are more logins during the day than at night.

You can use the raw flow rate for scaling. If lots of people want to log in, then you’ll need more user login services. You can determine appropriate levels for the number of services for a given flow rate, trigger new instance deployments when the flow rate goes up, and trigger instance retirements when the flow rate goes down.

The flow rate will also identify extreme conditions. If you deploy a new login service that’s fundamentally broken, the user login message-flow rate will decline dramatically. Similarly, if a misbehaving service is stuck in a loop and generating lots of user login messages, you’ll catch that as well.

You’d like to catch logic bugs that affect only a subset of data. The flow rate is down a little, but it’s hard to tell why. You’d also like to verify deployments of new versions of the user login service while limiting the impact of any new issues.

You need a measurement that’s independent of flow rate. The load on the system doesn’t matter; you’re interested in the correctness of the system. To achieve these goals, you can use the ratios of causally related message-flow rates.

In the user login example, a successful login loads the user’s profile to build their welcome page. A successful user-login message always causes a load-user-profile message. In any given time segment that’s not too small, the number of user-login messages should be about the same as the number of load-user-profile messages. In other words, the ratio of the flow rates should be about 1.

A ratio is a good measure because it’s dimensionless. Flow-rate ratios have the property that they’re the same no matter what the load is on the system or how many services are sending and receiving messages. All that matters is how many messages per second are flowing through the system.

Calculating message-flow rates and ratios

How do you calculate the message-flow rate? Your message transport, if it’s a message bus, may be able to do this for you. If not, you’ll need to do it yourself within the message abstraction layer, reporting the numbers back to your analytics solution.

A simple approach is to count the number of messages of each type seen in a given time window. You don’t store these numbers; you pass them on to be analyzed.

Calculating the ratio of message-flow rates is more difficult. Some time series databases allow you to do this directly. If not, then you’ll have to do it yourself within the message-abstraction layer. This isn’t ideal, because you’ll need to decide in advance which ratios to calculate, and you’ll also need to collect ratios from each service and average them.

None of this is rocket science, but you must factor it into your planning when you decide to use microservices. The development, deployment, and measurement infrastructure is necessarily more complex than it is for monoliths.

Ratios of message-flow rates are system invariants. That means that for a given configuration of message patterns, they remain the same despite changes to load, services, and network. You can use this fact to verify the correctness of the system. If you know there’s a causal relationship between two messages, in that the arrival of one causes the other to be sent, then you can capture the ratio of their flow rates over time. Deviations from the expected ratio value represent incorrect behavior. Figure 6.15 shows an example chart for the user-login versus load-user-profile case; as you can see, there’s been a worrying change in the ratio.

Figure 6.15. Ratio of message-flow rates

You can use this ratio chart as a control chart. If you deploy a new version of the user-login service using the Progressive Canary deployment pattern from chapter 5, then you can use the flow-rate ratio as validation that the new version hasn’t broken anything. If the new version of the service is completely broken, you’ll still see a change in the ratio—this is far more significant that a mere dip in the user login rate. The ratio should never change—that’s the nature of an invariant.

6.3.1. Finding invariants from the business logic

Each business requirement that’s encoded by a message flow establishes an invariant. Consider the microblogging system. When posting a new entry, it generates a synchronous message to the entry store and an asynchronous announcement message about the entry (see figure 6.16). Thus the flow-rate ratio between the post:entry message and the two messages it causes (info:entry and store:save,kind:entry) is 2.

Figure 6.16. Messages caused by posting an entry

Starting from the basic approach to microservice system design—the decomposition of business rules into message flows—you can build this set of invariants to validate the system under continuous change. This allows you to make changes quickly and safely without breaking things. When one or more invariants deviate, you can roll back the last change.

Invariants also let you check the health of the live system. Between deployment changes, many other things can go wrong: bugs can be triggered, load thresholds can be reached, memory leaks can push systems over the edge, poison messages can cause progressive degradation, and so on. You can use invariants to monitor the system for any failure to operate as designed. A problem, by definition, impacts the correct functionality of the system and breaches the invariants.

You should consider automating the configuration of invariants derived from business rules. Because the business rules are expressed directly as message flows, you can do this for every business rule.

6.3.2. Finding invariants from the system architecture

You can also derive invariants from the topology of the system architecture by working at the level of service instances. A synchronous/consumed message represents a classic request/response architecture. But you don’t implement this with just one service doing the requesting and just one service doing the responding. Typically, you scale with multiple requesters and responders, and you round-robin the messages over the responders using an appropriate algorithm.

Actor-style invariants

There’s an invariant here: the number of outbound events represents the total of each message sent. Every responder should see some fraction of that total, depending on how many responders there are. Given one requester and four responders, as shown in figure 6.17, each responder should see one quarter of the total messages as inbounds. In terms of the measurements defined earlier, inbound-receive/count for any given responder should be one quarter of outbound-send/count.

Figure 6.17. Actor/service interaction

Publish/subscribe-style invariants

The asynchronous/observed pattern has a similar analysis, except that each observer should see all the outbound messages (see figure 6.18). Thus, the ratio of outbound-send/count to inbound-receive/count per service is 1.

Figure 6.18. Publish/subscribe service interaction

Chain-style invariants

You can also construct invariants based on chains of causal messages. A single triggering message causes a chain of other messages over downstream services (see figure 6.19). Each downstream service represents a link in the chain, and the number of links is the invariant. This invariant is a useful way to capture problems with an entire causal chain that wouldn’t be obvious from observing one-on-one service interactions.

Figure 6.19. Service/chain interaction

Tree-style invariants

More often than not, the chain is part of a tree of interactions, where a triggering message causes a number of resulting message chains that themselves may cause subsequent chains (see figure 6.20). You can form invariants using each individual chain, and you’ll probably want to do this. But doing so doesn’t tell you directly whether the entire tree completed successfully. To do that, you can form an invariant using the tree’s leaf services. Eventually, all message chains must end, and the number of leaves (the final link in each chain) must always be the same.

Figure 6.20. Service/tree interaction

How do you choose a system invariant to use? This decision isn’t as simple as the case with business rules, because you have far more possibilities to choose from. You’ll need to examine the system and make an educated guess as to the most useful invariants, building them using the patterns we’ve discussed. Over time, your choices will improve as you gain familiarity with the system’s operational characteristics.

6.3.3. Visualizing invariants

Invariants are time series–based data. You calculate the ratio at a point in time (representing some previous time window). You’ll have lots of invariants, so you’ll again run into the problem of reviewing large numbers of time series charts. As before, you can use a current-behavior scatterplot to solve this problem.

By charting all the invariants in a scatterplot, you can get a full overview of the correctness of the entire system. This is particularly useful for highlighting any unintended consequences of the changes you make. You can use this approach to verify changes to services as well as changes to the network topology.^[24]

²⁴
An embellishment to consider is an animated scatterplot that shows the evolution of the system over time by animating the snapshots. It takes a little more work to build this as a custom chart, but it’s satisfying to observe in practice.

6.3.4. System discovery

If the microservice architecture is doing its job and letting you build functionality quickly, you’ll soon have hundreds of services and messages. Even though you designed most of the service interactions, other interactions in the system arise organically as teams respond quickly to business needs. You may not be sure exactly what’s going on anymore.

It’s important to call out this situation as one of the dangers of the microservice architecture. As complexity moves out of code and into the service interactions, you lose full understanding of the message flows, especially as you deploy ever-more-specialized services to handle new business cases. This is a natural result of the architecture and something to expect. Complexity can’t be destroyed, only moved around.

This effect is why it’s so important to have a message abstraction layer: it allows you to follow the principles of transport independence and pattern matching for messages. Doing so preserves the description of the messages in homogeneous terms, rather than making messages specific to services. Microservice systems that are built as lots of services interacting directly with each other quickly become a mess exactly because messages are specific to services. They’re effectively distributed monoliths that are just as difficult to understand, because there’s no easy way to observe all the message flows in a unified way.

Distributed tracing

A common messaging layer lets you introduce a common distributed tracing system. Such a system works by tracking message senders and receivers using correlation identifiers. The messaging layer can again help by providing these and attaching the necessary metadata to messages.

The distributed tracking system works by resolving the flow of messages over the system. It works out the causal structure of message flows by observing messages in the wild. In the microblogging example, a post:entry message triggers a store:save,kind:entry message and a post:info message. The post:info message triggers a timeline:insert message. If you trace this over time, you can build a progress diagram like that shown in figure 6.21.

Figure 6.21. Distributed trace of an entry post over microservices

The trace chart shows the chain of message interactions; read downward from the first message. Each line represents a new triggered message in the chain. The timings show how long it took to get a response (or for an observed message to be delivered). Using the progress chart, you can analyze the actual message flows in the system, see the services they impact, and understand the message-processing times.

Capturing the data to build such diagrams from a live system is expensive. You don’t want to trace every flow, because doing so puts too much load on the system. But you can build up an accurate view by sampling: trace a small percentage of all messages. This doesn’t affect performance and gives you the same information.

Finally, message traces allow you to reverse-engineer message interactions. You can build a diagram of the relationships between the services based on the observed messages. This diagram then gives you an understanding of the system as it is. You can compare it against your desired design to uncover deviations and determine where complexity has arisen. Often, this is business complexity that’s unavoidable, but at least you know that it exists and where it lives.

Distributed logging

You should capture the microservice logs for each service instance. To do this, you’ll need to invest in a distributed logging solution. It isn’t feasible to review the logs of individual service instances directly as a general practice—there’s too much data. This is another consequence of the microservice architecture and another example of the fact that you’ll need more infrastructure to implement it.

The easiest approach to distributed logging, if your business and confidentiality constraints allow it, is to use an online logging service. Such services provide agents for log collection, log storage, and, most important, user-friendly search capabilities. If you’re unable to use an online service, open source solutions are available.^[25] You’ll need one that provides the best search interface.

²⁵
Elasticsearch (www.elastic.co) is a good option here.

In a monolithic system, you can get away with grepping^[26] log files on the server to debug issues—almost. With a microservice system—and I’m speaking from personal experience—it’s almost impossible to trace message-correlation identifiers over numerous log files to debug a production issue. That’s why you need strong search capabilities.

²⁶
To grep is to manually search text files on the command line, using the grep command.

Manually reviewing logs should happen toward the end of your investigations. Once you’ve narrowed the issue to specific patterns, services, instances, or data entities, you can search the logs effectively.

Error capture

Closely related to log capture is error capture. You should capture all errors in the system on both the server and client sides. Again, commercial offerings are available to do this, but you can also find open source solutions.

Although a distributed logging solution can also capture errors and will log them, it’s better to move error analysis to a separate system. Errors can occur in a far wider range of domains, such as mobile apps, and attempting to force them into the logging model will prevent effective analysis. It’s also important to use the errors for alerting and deployment validation; you don’t want them to be constrained by the processing volume of ordinary logs.

6.3.5. Synthetic validation

You don’t need to wait for problems to find you—you can find them. As part of your monitoring, you can use synthetic messages to validate the system. A synthetic message is a test message that doesn’t affect real business processes. It’s a fake order or user, executing fake actions.

You can use synthetic messages to measure the correctness of the system on a continuous basis. They aren’t limited to API requests, as with a monolith. You can generate any message and inject it into the system. This is another advantage of the homogeneous message layer.

There’s no need to limit synthetic test messages to development and staging systems; you can also run them in production to provide a strong measure of the system under the continuous deployment of new services. This mechanism is good at capturing unintended consequences: you can break services far from the one you’re working on, and you’ll find that the chain of causality between messages can be wide in terms of both space and time.

6.4. Summary

Measuring microservice systems starts with appreciating their scale in terms of the number of elements. Traditional monitoring that focuses on individual elements won’t suffice. Nor can you understand the microservice system by using standard summary statistics. You need to use alternative measurements and visualizations to understand what’s going on.
There are three layers of measurements: the business requirements, the messages that encode them, and the services that send and receive the messages. These layers can be used to structure and organize the monitoring of the system.
To comprehend a system with large numbers of moving parts, you can use scatterplots as a visualization technique. These allow you to compare the behavior of a large number of appropriately grouped elements. The current-behavior scatterplot lets you compare behavior over time by plotting against historical norms.
In the business layer, you can use message flows to validate the correctness of defined workflows and calculate key performance indicators.
The measures to use in the message layer can be derived from the categorization of messages into synchronous/asynchronous and observed/consumed. For each category, you can count the number of occurrences of important events, such as outbound and inbound messages, and timings of message processing and network traversal.
In the service layer, you can map the network structure to expected relationships between message counts and timing at both the service type and service instance level. This allows you to verify that your architecture and message-flow design are operating as designed. It also lets you identify problematic network elements.
You can establish invariants—dimensionless numbers—that should be effectively constant if the system is operating correctly and is healthy. Invariants can be derived from the ratios of message-flow rates, both at the message-pattern level and at the service level.
Correlation identifiers are a vital tool and serve as input to tracing and logging systems. The tracing system can build a live map of the architecture using the traces, and the logging system, by allowing you to search by correlation identifier, provides an effective mechanism for debugging the live system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6. Measurement

Create new playlist

Sign In

Sign Up

Chapter 6. Measurement

6.1. The limits of traditional monitoring

6.1.1. Classical configurations

Figure 6.1. Classical response-time charts

6.1.2. The problem with averages

Figure 6.2. Histogram chart showing response times

Figure 6.3. Histogram chart showing response times, with caching causing two peaks

6.1.3. Using percentiles

Figure 6.4. Histogram chart (no cache) showing response times, with 90th percentile

Figure 6.5. Histogram chart (with cache) showing response times, with 90th percentile

Figure 6.6. Time series chart of average and 90th percentile response times

6.1.4. Microservice configurations

6.1.5. The power of scatterplots

Figure 6.7. Scatterplot of weight (in kilograms) versus height (in centimeters)

Figure 6.8. Current-behavior scatterplot of microservice response times

6.1.6. Building a dashboard

6.2. Measurements for microservices

6.2.1. The business layer

6.2.2. The message layer

Universal measurements

Measuring synchronous and consumed messages

Note

Table 6.1. Outbound message measures

Figure 6.9. Synchronous message outbound send and response counts

Figure 6.10. Outbound response failure counts over time

Figure 6.11. Current response-time behavior by message family

Table 6.2. Inbound message measures

Figure 6.12. Different message transports have different delivery speeds.

Measuring synchronous and observed messages

Table 6.3. Inbound message measures

Measuring asynchronous and observed messages

Table 6.4. Outbound message measures

Table 6.5. Inbound message measures

Measuring asynchronous and consumed messages

6.2.3. The service layer

Where’s the problem?

Upstream problems

Local problems

Downstream problems

Service-instance health

Figure 6.13. Current behavior of service instances of the same type

Figure 6.14. Categorical scatterplot of service-instance response times

6.3. The power of invariants

Figure 6.15. Ratio of message-flow rates

6.3.1. Finding invariants from the business logic

Figure 6.16. Messages caused by posting an entry

6.3.2. Finding invariants from the system architecture

Actor-style invariants

Figure 6.17. Actor/service interaction

Publish/subscribe-style invariants

Figure 6.18. Publish/subscribe service interaction

Chain-style invariants

Figure 6.19. Service/chain interaction

Tree-style invariants

Figure 6.20. Service/tree interaction

6.3.3. Visualizing invariants

6.3.4. System discovery

Distributed tracing

Figure 6.21. Distributed trace of an entry post over microservices

Distributed logging

Error capture

6.3.5. Synthetic validation

6.4. Summary

Table of Contents for
Chapter 6. Measurement