You’ve now set up an infrastructure to run your services and have deployed multiple components that you can combine to provide functionality to your users. In this chapter and the next, we’ll consider how you can make sure you’ll always be able to know how those components are interacting and how the infrastructure is behaving. It’s fundamental to know as early as possible when something isn’t behaving as expected. In this chapter, we’ll focus on building a monitoring system so you can collect relevant metrics, observe the system behavior, and set up relevant alerts to allow you to keep your systems running smoothly by taking actions preemptively. When you can’t be preemptive, you’ll at least be able to quickly pinpoint the areas that need your attention so you can address any issues. It’s also worth mentioning that you should instrument as much as possible. The collected data you may not use today may turn out to be useful someday.
A robust monitoring stack will allow you to start gathering metrics from your services and infrastructure and use those metrics to gather insights from the operation of a system. It should provide a way to collect data and store, display, and analyze it.
You should start by emitting metrics from your services, even if you have no monitoring infrastructure in place. If you have those metrics stored, at any time you’ll be able to access, display, and interpret them. Observability is a continuous effort, and monitoring is a key element in that effort. Monitoring allows you to know whether a system is working, whereas observability lets you ask why it’s not working.
In this chapter, we’ll be focusing on monitoring, metrics, and alerts. We’ll explain logs and traces in chapter 12, and they’ll constitute the observability component.
Monitoring doesn’t only allow you to anticipate or react to issues, you can also use collected metrics from monitoring to predict system behavior or to provide data for business analytic purposes.
Multiple open source and commercial options are available for setting up a monitoring solution. Depending on the team size and resources available, you may find that a commercial solution may be easier or more convenient to use. Nonetheless, in this chapter you’ll be using open source tools to build your own monitoring system. Your stack will be made up of a metrics collector and a display and alerting component. Logs and traces are also essential to achieve system observability. Figure 11.1 gives an overview of all the components you need to be able to understand your system behavior and achieve observability.
In figure 11.1, we display the components of a monitoring stack:
Each of these components feeds into its own dashboards as an aggregation of data from multiple services. This allows you to set up automated alerts and look into all the collected data to investigate any issues or better understand system behavior. Metrics will enable monitoring, whereas logs and traces will enable observability.
In chapter 3, we discussed the architecture tiers: client, boundary, services, and platform. You should implement monitoring in all of these layers, because you can’t determine the behavior of a given component in total isolation. A network issue will most likely affect a service. If you collect metrics at the service level, the only thing you’ll be able to know is that the service itself isn’t serving requests. That alone tells you nothing about the cause of the issue. If you also collect metrics at the infrastructure level, you can understand problems that’ll most likely affect multiple other components.
In figure 11.2, you can see the services that work together to allow a client to place an order for selling or buying shares. Multiple services are involved. Some communication between services is synchronous, either via RPC or HTTP, and some is asynchronous, using an event queue. To be able to understand how services are performing, you need to be able to collect multiple data points to monitor and either diagnose issues or prevent them before they even arise.
Monitoring individual services will be of little to no use because services provide isolation but don’t exist isolated from the outside world. Services often depend on each other and on the underlying infrastructure (for example, the network, databases, cache stores, and event queues). You can get a lot of valuable information by monitoring services, but you need more. You need to understand what’s going on in all your layers.
Your monitoring solution should allow you to know what is broken or degrading and why. You’ll be able to quickly reveal any symptoms and use the available monitors to determine causes.
Referring to figure 11.2, it’s worth mentioning that symptoms and causes vary depending on the observation point. If the market service might be having issues communicating with the stock exchange, you can diagnose that by measuring response times or HTTP status codes for that interaction. In that situation, you’ll be almost sure that the place order feature won’t be working as expected.
But what if you have an issue with connectivity from services to the event queue? Services won’t be publishing messages, so downstream services won’t be consuming them. In that situation, no service is failing because no service is performing any work. If you have proper monitoring in place, it can alert you to the abnormal decrease in throughput. You can set your monitoring solution to send you automated notifications when the number of messages in a given queue goes below a certain threshold.
Lack of messages isn’t the only thing that can indicate issues, though. What if you have messages accumulating in a given queue? Such accumulation may indicate the services that consume messages from the queue are either not working properly or are having trouble keeping up with increased demand. Monitoring allows you to identify issues or even predict load increases and act accordingly to maintain service quality. Let’s take some time for you to learn a bit more about the signals you should collect.
You should focus on four golden signals while collecting metrics from any user-facing system: latency, errors, traffic, and saturation.
Latency measures how much time passes between when you make a request to a given service and when the service completes the request. You can determine a lot from this signal. For example, you can infer that the service is degrading if it shows increasing latency. You need to take extra care, though, in correlating this signal with errors. Imagine you’re serving a request and the application responds quickly but with an error? Latency has a low value in this case, but the outcome isn’t the desired one. It’s important to keep the latency of requests that result in errors out of this equation, because it can be misleading.
This signal determines the number of requests that don’t result in a successful outcome. The errors may be explicit or implicit — for example, having an HTTP 500 error versus having an HTTP 200 but with the wrong content. The latter isn’t trivial to monitor for because you can’t rely solely on the HTTP codes, and you may only be able to determine the error by finding wrong content in other components. You generally catch these errors with end-to-end or contract tests.
This signal measures the demand placed on a system. It can vary depending on the type of system being observed, the number of requests per second, network I/O, and so on.
At a given point, this measures the capacity of the service. It mainly applies to resources that tend to be more constrained, like CPU, memory, and network.
While collecting metrics, you need to determine the type that’s best suited for a given resource you’re aiming to monitor.
Counters are a cumulative metric representing a single numerical value that’ll always increase. Examples of metrics using counters are:
You shouldn’t use a counter if the metric it represents can also decrease. For that, you should use a gauge instead.
Gauges are metrics representing single numerical arbitrary values that can go up or down. Some examples of metrics using gauges are:
You use histograms to sample observations and categorize them in configurable buckets per type, time, and so on. Examples of metrics represented by histograms are:
As we already mentioned, you should make sure you instrument as much as possible to collect as much data as you can about your services and infrastructure. You can use the collected data at later stages once you devise new ways to correlate and expose it. You can’t go back in time to collect data, but you can make data available that you previously collected.
Keep in mind that you should go about representing that data, showing it in dashboards, and setting up alerts in a progression to avoid having too much information at once that will be hard to reason through. There is no point in throwing every single collected metric for a service into one dashboard. You can create several dashboards per service with detailed views, but keep one top-level dashboard with the most important information. This dashboard should allow you, in a glance, to determine if a service is operating properly. It should give a high-level view of the service, and any more in-depth information should appear in more specialized dashboards.
When representing metrics, you should focus on the most important ones, like response times, errors, and traffic. These will be the foundation of your observability capabilities. You also should focus on the right percentiles for each use case: 99th, 95th, 75th, and so on. For a given service, it may be good enough if only 95% of your requests take less than x seconds, whereas on another service you may require 99% of the requests to be below that time. There is no fixed rule for which percentile to focus on — that generally depends on the business requirements.
Whenever possible, you should use tags to provide context to your metrics. Examples of tags to associate with metrics are:
By tagging metrics, you can group them later on and perhaps come up with some more insights. Take, for example, a response time you’ve tagged with the User ID; you can group the values by user and determine if all of the user base or only a particular group of users experiences an increase in response times.
Make sure you always abide by some defined standards when you’re naming metrics. It’s important that you maintain a naming scheme across services. One possible way of naming metrics is to use the service name, the method, and the type of metric you wish to collect. Here are some examples:
orders_service.sell_shares.count
orders_service.sell_shares.success
fees_service.charge_fee.failure
account_transactions_service.
request_reservation.max
gateway.sell_shares.avg
market_service.place_order.95percentile
You need to send the metrics you collect from your services and infrastructure to a system capable of aggregating and displaying them. The system will use those collected metrics to provide alerting capabilities. For that purpose, you’ll be using Prometheus to collect metrics and Grafana to display them:
You’ll do all your setup using Docker. In chapter 7, you already added to your services the ability to emit metrics via StatsD. You’ll keep those services unchanged and add something to your setup to convert metrics from StatsD format to the format that Prometheus uses. You’ll also add a RabbitMQ container that’s already set up to send metrics to Prometheus. Figure 11.3 shows the components you’ll be adding to set up your monitoring system.
You’ll be using both Prometheus and StatsD metrics as a way to show how two types of metrics collection protocols can coexist. StatsD is a push-based tool, whereas Prometheus is a pull-based tool. Systems using StatsD will be pushing data to a collector service, whereas Prometheus will pull that data from the emitting systems.
You’ll start by adding the services described in figure 11.2 to the Docker compose file, then you’ll focus on configuring both the StatsD exporter and Prometheus. The last step will be to create the dashboards in Grafana and start monitoring the services and the event queue. All the code is available in the book’s code repository.
The Docker compose file (see the next listing) will allow you to boot all the services and infrastructure needed for the place orderfeature. For the sake of brevity, we’ll omit the individual services and will only list the infrastructure- and monitoring-related containers.
Listing 11.1 docker-compose.yml file
(…)
rabbitmq: ①
container_name: simplebank-rabbitmq
image: deadtrickster/rabbitmq_prometheus
ports:
- "5673:5672"
- "15673:15672"
redis:
container_name: simplebank-redis
image: redis
ports:
- "6380:6379"
statsd_exporter: ②
image: prom/statsd-exporter
command: "-statsd.mapping-config=/tmp/
➥statsd_mapping.conf" ③
ports:
- "9102:9102"
- "9125:9125/udp"
volumes:
- "./metrics/statsd_mapping.conf:/tmp/statsd_mapping.conf"
prometheus: ④
image: prom/prometheus
command: "--config.file=/tmp/prometheus.yml
➥--web.listen-address '0.0.0.0:9090'" ⑤
ports:
- "9090:9090"
volumes:
- "./metrics/prometheus.yml:/tmp/prometheus.yml"
statsd: ⑥
image: dockerana/statsd
ports:
- "8125:8125/udp"
- "8126:8126"
volumes:
- "./metrics/statsd_config.js:/src/statsd/
➥config.js" ⑦
grafana: ⑧
image: grafana/grafana
ports:
- "3900:3000"
As we mentioned before, the services involved in the place order feature emit metrics in the StatsD format. In table 11.1, we list all the services and the metrics each one emits. The services will all be emitting timer metrics.
Service | Metrics |
Account transactions | request_reservation |
Fees | charge_fee |
Gateway | health, sell_shares, |
Market | request_reservation, place_order_stock_exchange |
Orders | sell_shares, request_reservation, place_order |
The mapping config file allows you to configure each metric that StatsD collects and add labels to it. The following listing provides the mapping you’ll create as a configuration file for the statsd-exporter container.
Listing 11.2 Configuration file to map StatsD metrics to Prometheus
simplebank-demo.account-transactions.request_reservation ①
name="request_reservation" ②
app="account-transactions" ③
job="simplebank-demo" ④
simplebank-demo.fees.charge_fee ⑤
name="charge_fee"
app="fees"
job="simplebank-demo"
simplebank-demo.gateway.health ⑥
name="health"
app="gateway"
job="simplebank-demo"
simplebank-demo.gateway.sell_shares
name="sell_shares"
app="gateway"
job="simplebank-demo" ⑥
simplebank-demo.market.request_reservation ⑦
name="request_reservation"
app="market"
job="simplebank-demo"
simplebank-demo.market.place_order_stock_exchange
name="place_order_stock_exchange"
app="market"
job="simplebank-demo" ⑦
simplebank-demo.orders.sell_shares ⑧
name="sell_shares"
app="orders"
job="simplebank-demo"
simplebank-demo.orders.request_reservation
name="request_reservation"
app="orders"
job="simplebank-demo"
simplebank-demo.orders.place_order
name="place_order"
app="orders"
job="simplebank-demo" ⑧
If you don’t map the above metrics to Prometheus, they’ll still get collected, but the way they’ll be collected is less convenient. In the figure 11.4 example, you can see the difference between mapped and unmapped metrics fetched from Prometheus from the statsd_exporter service.
As you can observe in figure 11.4, when the unmapped create_event
metrics that both the market and orders service emit reach Prometheus, they’re collected as:
simplebank_demo_market_create_event_timer
simplebank_demo_orders_create_event_timer
For the request_reservation_timer
metric that the market, orders, and account transactions services emit, there’s only one entry, the metric is the same, and the differentiation is in the metadata:
request_reservation_timer{app="*
",exported_job="simplebank-demo",exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.5"} ①
request_reservation_timer{app="*
",exported_job="simplebank-demo",exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.9"}
request_reservation_timer{app="*
",exported_job="simplebank-demo",exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.99"}
simplebank_demo_market_create_event_timer{exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.5"} ②
Now that you’ve configured the StatsD exporter, it’s time to configure Prometheus for it to fetch data from both the StatsD exporter and RabbitMQ, as shown in the following listing. Both of these sources will be available as targets for metrics data fetching.
Listing 11.3 Prometheus configuration file
global:
scrape_interval: 5s ①
evaluation_interval: 10s
external_labels:
monitor: 'simplebank-demo'
alerting:
alertmanagers:
- static_configs:
- targets:
scrape_configs: ②
- job_name: 'statsd_exporter' ③
static_configs:
- targets: ['statsd-exporter:9102']
labels:
exporter: 'statsd'
metrics_path: '/metrics'
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15672'] ④
labels:
exporter: 'rabbitmq'
metrics_path: '/api/metrics' ④
To receive metrics in Grafana, you need to set up a data source. First, you can boot your applications and infrastructure by using the Docker compose file. This will allow you to access Grafana on port 3900, as follows.
Listing 11.4 Grafana setup in the docker-compose.yml file
(...)
grafana:
image: grafana/grafana ①
ports:
- "3900:3000" ②
To start all applications and services using Docker compose, you need to get inside the folder containing the compose file and issue the up
command:
chapter-11$ docker stop $(docker ps | grep simplebank |
➥awk '{print $1}') ①
chapter-11$ docker rm $(docker ps -a | grep simplebank |
➥awk '{print $1}') ②
chapter-11$ docker-compose up --build --remove-orphans ③
Starting simplebank-redis ...
Starting chapter11_statsd-exporter_1 ...
Starting chapter11_statsd_1 ...
Starting simplebank-rabbitmq ...
Starting chapter11_prometheus_1 ...
Starting simplebank-rabbitmq ... done
Starting simplebank-gateway ...
Starting simplebank-fees ...
Starting simplebank-orders ...
Starting simplebank-market
Starting simplebank-account-transactions ... done
Attaching to chapter11_prometheus_1, simplebank-redis, chapter11_statsd_1, simplebank-rabbitmq, chapter11_statsd-exporter_1, simplebank-gateway, simplebank-fees, simplebank-orders, simplebank-market, simplebank-account-transactions
(…)
The output of the docker-compose up
command will allow you to understand when all services and applications are ready. You can reach applications using the URL assigned to Docker or the IP address. By appending port 3900 as configured in the docker-compose.yml file, you can access Grafana’s login screen as shown in figure 11.5. You’ll be accessing Grafana using the default login credentials: username and password are both admin.
Once you log in, you’ll have an Add Data Source option. Figure 11.6 shows the data source configuration screen, Edit Data Source. To configure a Prometheus data source in Grafana, you need to select Prometheus as the type and insert the URL of the running Prometheus instance, in your case http://prometheus:9090, as configured in the Docker compose file.
The Save & Test button will give you instant feedback on the data source status. Once it’s working, you’re ready to use Grafana to build dashboards for your collected metrics. In the next few sections, you’ll be using it to display metrics both for the services that enable the place orders functionality in SimpleBank and for monitoring a critical piece of the infrastructure, RabbitMQ, the event queue.
To set up the dashboard to monitor RabbitMQ, you’ll be using a json configuration file. This is a convenient and easy way to share dashboards. In the source code repository, you’ll find a grafana folder. Inside that, a RabbitMQ Metrics.json file holds the configuration for both the dashboard layout and the metrics you want to collect. You can now import that file to have your RabbitMQ monitoring dashboard up in no time!
Figure 11.7 shows how you can access the import dashboard functionality in Grafana. By clicking Grafana’s logo, you bring up a menu; if you hover over Dashboards, the Import option will be available.
The import option will bring up a dialog box that enables you to either paste the json in a text box or upload a file. Before you can use the imported dashboard, you need to configure the data source that’ll feed the dashboard. In this case, you’ll be using the SimpleBank data source you configured previously.
That’s all it takes to have your RabbitMQ dashboard up and running. In figure 11.8 you can see how it looks.
Your RabbitMQ dashboard provides you an overview of the system by displaying a monitor for the server status that shows if it’s up or down, along with graphs for Exchanges, Channels, Consumers, Connections, Queues, Messages per Host, and Messages per Queue. You can hover over any graph to display details for metrics at a point in time. Clicking the graph’s title will bring up a context menu that allows you to edit, view, duplicate, or delete it.
Now that you have services up and running, along with the monitoring infrastructure, Prometheus and Grafana, it’s time to collect the metrics described in table 11.1. You can start by loading a dashboard exported as json that you can find in the source directory under the grafana folder (Place Order.json). Follow the same instructions as the ones in 11.2.2 for the RabbitMQ dashboard.
Figure 11.9 displays the dashboard collecting metrics for the services involved in the place order feature. By clicking on each of the panel titles, you can view, edit, duplicate, share, and delete each of the panels.
This loaded dashboard collects the time metrics and displays the 0.5, 0.9, and 0.99 quantiles for each metric. In the top right corner, you find the manual refresh button as well as the period for displaying metrics. By clicking the Last 5 Minutes label, you can select another period for displaying metrics, as shown in figure 11.10. You can select one of the Quick Ranges values or create a custom one, and you can display stored metrics in any range you need.
Let’s focus on the Market | Place Order Stock Exchange panel to see in detail how you can configure a specific metric display. To do so, click the panel title and then select the Edit option. Figure 11.11 shows the edit screen for the Market | Place Order Stock Exchange.
The edit screen has a set of tabs (1) you can select to configure different options. The highlighted one is the Metrics tab, where you can add and edit metrics to be displayed. In this particular case, you’re only collecting a metric (2), namely the place_order_stock_exchange_timer that gives you the time it took for the market service to place an order into the stock exchange. The default display for a metric contains metadata like the app name, the exported job, and the quantile. To change the way the legend is presented, you set a Legend Format (3). In this case, you set the name and use {{quantile}}
block that’ll be interpolated to display the quantile in both the graph legend and the hovering window next to the vertical red line. (The red line acts as a cursor when you move your mouse across the collected metrics.) In your dashboards, you’re displaying the min, max, avg, and current values for each quantile (4).
The dashboard you’ve set up is quite simple, but it allows you to have an overview of how the system is behaving. You’re able to collect time-related metrics for several actions that services in your system perform.
Now that you’re collecting metrics and storing them, you can set up alerts for when values deviate from what you consider as normal for a given metric. This can be an increase in the time taken to process a given request, an increase in the percent of errors, an abnormal variation in a counter, and so on.
In your case, you can consider the market service and set up an alert for knowing when the service needs to be scaled. Once you place a sell order via the gateway service, a lot goes on. Multiple events are fired, and you know the bottleneck tends to be the market service processing the place order event. The good thing is you can set up an alert to send a message whenever messages in the market place order queue go above a certain threshold. You can configure multiple channels for notifications: email, slack, pagerduty, pingdom, webhooks, and so on.
You’ll be setting up a webhook notification to receive a message in your alert server every time the number of messages goes above 100 in any message queue. For now, you’ll only be receiving it in an alert service made with the purpose of illustrating the feature. But you could easily change this service to trigger an increase in the number of instances of a given service to increase the capacity to process messages from a queue.
The alert service is a simple app that also booted when you started all other apps and services. It’ll be listening for incoming POST
messages, so you can go ahead and configure the alerts in Grafana. Figure 11.12 shows the activity for the market place order event queue with indication of alerts, both when they were triggered and when the alert condition ceased. When you set up alerts, Grafana will indicate as an overlay both the threshold set for alerting (1) and the instants when alerts were triggered (2, 4, 6) and resolved (3, 5, 7).
With the current setup, the alert service sends an alert message as a webhook when the number of messages in any queue goes over 100. The following shows you one of those alert messages:
alerts.alert.d26ab4ca-1642-445f-a04c-41adf84145fd:
{
"evalMatches": [
{
"value":158.33333333333334, ①
"metric":"evt-orders_service-order_created
➥--market_service.place_order", ②
"tags":{
"__name__":"rabbitmq_queue_messages",
"exporter":"rabbitmq",
"instance":"rabbitmq:15672",
"job":"rabbitmq",
"queue":"evt-orders_service-order_created
➥--market_service.place_order", ③
"vhost":"/"
}
}
],
"message":"Messages accumulating in the queue",
"ruleId":1,
"ruleName":"High number of messages in a queue", ④
"ruleUrl":"http://localhost:3000/dashboard/db/rabbitmq-metrics?fullscreen
➥u0026editu0026tab=alertu0026panelId=2u0026orgId=1",
"state":"alerting", ⑤
"title":"[Alerting] High number of messages in a queue"
}
Likewise, when the number of messages in a queue goes below the value defined as the threshold for alerting, the service also issues a message to notify about it:
alerts.alert.209f0d07-b36a-43f4-b97c-2663daa40410:
{
"evalMatches":[],
"message":"Messages accumulating in the queue",
"ruleId":1,
"ruleName":"High number of messages in a queue", ①
"ruleUrl":"http://localhost:3000/dashboard/db/rabbitmq-metrics?fullscreen
➥u0026editu0026tab=alertu0026panelId=2u0026orgId=1",
"state":"ok", ②
"title":"[OK] High number of messages in a queue"
}
Let’s now see how you can set up this alert for the number of messages in queues. You’ll also be using Grafana for setting up the alert, because it offers this capability and the alerts will display on the panels they relate to. You’ll be able to both receive notifications and check the panels for previous alerts.
You’ll start by adding a notification channel that’ll you’ll use to propagate alert events. Figure 11.13 shows how to create a new notification channel
To set up a new notification channel in Grafana, follow these steps:
POST
requests.Now that you have an alert channel set up, you can go ahead and create alerts on your panels. You’ll be setting an alert on the messages queue panel under the RabbitMQ dashboard you created previously. Clicking the Messages/Queue panel title will bring up a menu where you can select Edit. This allows you to create a new alert under the Alert tab. Figure 11.14 shows how to set up a new alert.
Under the Alert Config screen, start by adding the Name for the alert as well as the frequency at which you want the condition to be evaluated — in this case every 30 seconds. The next step is to set the Conditions for the alert. You’ll be setting an alert to notify you whenever the average of the values collected from query A is above 100 in the last minute.
Under the Alert tab, you also can check the history of the configured alerts. Figure 11.15 shows the alert history for the number of messages in queues.
That’s it, you’re done! You’ve set up a monitoring infrastructure to collect both metrics that your services already emitted and those that come from a key component that those services use to communicate asynchronously: the event queue. You’ve also seen how to create alerts to be notified whenever certain conditions in your system are met. Let’s now dig a bit deeper into alerts and how to use them.
Having a monitoring infrastructure in place means you can measure system performance and keep a historic record of those measures. It also means you can determine thresholds for your measures and automatically emit notifications when those thresholds are exceeded.
One thing you need to keep in mind, though, is that it’s easy to reach a stage where all this information can become overwhelming. Eventually, the overload of information can do more harm than good (for example, if it gets so bad that people start ignoring recurring alerts). You need to make sure that the alerts you raise are actionable — and actioned — and that they’re targeting the correct people in the organization.
Although services may consume and take action on some of the alerts automatically; for example, autoscaling a service if messages are accumulating in a queue, humans need to consume and take action on some alerts. You want those alerts to reach the correct people and contain enough information so that diagnosing the cause becomes as easy as possible.
You also need to prioritize alerts, because most likely any issue with your services or infrastructure will trigger multiple alerts. Whoever is dealing with those alerts needs to know immediately the urgency of each one. As a rule, you should direct alerts for services to the teams owning those services. You should map the application into the organization, because this helps with determining the targets for alerts.
In day-to-day operation, alerts should target the team who owns the service and originated it. This reflects the “you build it, you run it” mantra that should govern a microservices-oriented engineering team. As teams create and deploy services, it’s hard, if not impossible, for everyone to know about every service deployed. People with the most knowledge about a service will be in the best position to interpret and take action in response to alerts that the service generates.
Organizations also may have some on-call rotation or a dedicated team that’ll receive and monitor alerts and then escalate if necessary to specialized teams. When setting up alerts and notifications, it’s important to keep in mind that other people may consume them, so you should keep those alerts as concise and informative as possible. It’s also important that each service have some sort of documentation on common issues and diagnosing recipes so that on-call teams can, when they receive an alert, determine if they can fix the issue or if they need to escalate it.
You also should categorize alerts by levels of urgency. Not every issue will need immediate attention, but some are deal breakers that you need to address as soon as you know about them.
Severe issues should trigger a notification to ensure someone, either an engineer from the team that built the service or an on-call engineer, is notified. Issues that are of moderate severity should generate alerts as notifications in any channels deemed appropriate, so those monitoring them can pick them up. You can think of this type of alert as something generating a queue of tasks that you need to carry out as soon as possible but not immediately — they don’t need to interrupt someone’s flow or wake someone up in the middle of the night. The lowest priority alerts are those that only generate a record. These alerts aren’t strictly for human consumption, because services can receive them and take some kind of action if needed (for example, autoscaling a service when response times increase).
Symptoms, not causes, should trigger alerts. An example of this is a user-facing error; if users can no longer access a service, that inability should generate an alert. You shouldn’t be tempted to trigger alerts for every single parameter that isn’t under the normal threshold. With such partial information, you won’t be able to know what’s going on or what the problem is. In figure 11.2, we illustrated the flow for placing orders in the stock market. Four services cooperate with a gateway that works as the access point for the consumer of the feature. One or more of the services may be exhibiting erroneous behavior or be overloaded. Given the mainly asynchronous nature of the communication between components, it may be hard to pinpoint why a given error may be happening.
Imagine you set an alert that relates the number of requests reaching the gateway and the number of issued notifications of orders placed. It’ll be simple to correlate those two metrics over time and determine the ratio between the two. You’ll have a symptom: the number of orders placed is greater than the ones completed. You can start from there and then try to understand which component is failing (maybe even multiple components). Is it the event queue or an infrastructure problem? Is the system under high loads and can’t cope? The symptom will be the starting point for your investigation, and from there you should follow the leads until you find the cause or causes.
Correlating metrics can be a precious tool to infer and understand more than a per-service state of the system. Monitoring can also help you understand and reason through the behavior of the system under different conditions, and this can help you to predict and adjust your capacity by using all the collected data. The good thing about collecting per-service metrics is you can iteratively correlate them between different services and have an overall idea of the behavior of the whole application. In figure 11.16, you can see a possible correlation of different service metrics.
Let’s look into each of the suggested correlations:
Combining different metrics into new dashboards and setting sensible alerts on them allows you to gain insights into the overall application. It’s then up to you to determine the desired level of detail, from a high-level view to a detailed one.
So far, we’ve covered monitoring and alerting. You’ve set up a monitoring stack to be able to understand how things happened. You’re now able to understand the status of services, observe the metrics they emit, and determine if they’re operating within expected parameters. This is only part of the application observability effort. It’s a good starting point, but you do need more!
To be able to fully understand what’s going on, you need to invest some more in logging and tracing so you can have both a current view of what’s happening and a view of what happened before. In the next chapter, we’ll focus on logging and tracing as a complement to monitoring in your journey into observability. Doing so will help you to understand why things happened.