11
Building a monitoring system

This chapter covers

  • Understanding what signals to gather from running applications
  • Building a monitoring system to collect metrics
  • Learning how to use the collected signals to set up alerts
  • Observing the behavior of individual services and their interactions as a system

You’ve now set up an infrastructure to run your services and have deployed multiple components that you can combine to provide functionality to your users. In this chapter and the next, we’ll consider how you can make sure you’ll always be able to know how those components are interacting and how the infrastructure is behaving. It’s fundamental to know as early as possible when something isn’t behaving as expected. In this chapter, we’ll focus on building a monitoring system so you can collect relevant metrics, observe the system behavior, and set up relevant alerts to allow you to keep your systems running smoothly by taking actions preemptively. When you can’t be preemptive, you’ll at least be able to quickly pinpoint the areas that need your attention so you can address any issues. It’s also worth mentioning that you should instrument as much as possible. The collected data you may not use today may turn out to be useful someday.

11.1 A robust monitoring stack

A robust monitoring stack will allow you to start gathering metrics from your services and infrastructure and use those metrics to gather insights from the operation of a system. It should provide a way to collect data and store, display, and analyze it.

You should start by emitting metrics from your services, even if you have no monitoring infrastructure in place. If you have those metrics stored, at any time you’ll be able to access, display, and interpret them. Observability is a continuous effort, and monitoring is a key element in that effort. Monitoring allows you to know whether a system is working, whereas observability lets you ask why it’s not working.

In this chapter, we’ll be focusing on monitoring, metrics, and alerts. We’ll explain logs and traces in chapter 12, and they’ll constitute the observability component.

Monitoring doesn’t only allow you to anticipate or react to issues, you can also use collected metrics from monitoring to predict system behavior or to provide data for business analytic purposes.

Multiple open source and commercial options are available for setting up a monitoring solution. Depending on the team size and resources available, you may find that a commercial solution may be easier or more convenient to use. Nonetheless, in this chapter you’ll be using open source tools to build your own monitoring system. Your stack will be made up of a metrics collector and a display and alerting component. Logs and traces are also essential to achieve system observability. Figure 11.1 gives an overview of all the components you need to be able to understand your system behavior and achieve observability.

In figure 11.1, we display the components of a monitoring stack:

  • Metrics
  • Logs
  • Traces

Each of these components feeds into its own dashboards as an aggregation of data from multiple services. This allows you to set up automated alerts and look into all the collected data to investigate any issues or better understand system behavior. Metrics will enable monitoring, whereas logs and traces will enable observability.

11.1.1 Good monitoring is layered

In chapter 3, we discussed the architecture tiers: client, boundary, services, and platform. You should implement monitoring in all of these layers, because you can’t determine the behavior of a given component in total isolation. A network issue will most likely affect a service. If you collect metrics at the service level, the only thing you’ll be able to know is that the service itself isn’t serving requests. That alone tells you nothing about the cause of the issue. If you also collect metrics at the infrastructure level, you can understand problems that’ll most likely affect multiple other components.

In figure 11.2, you can see the services that work together to allow a client to place an order for selling or buying shares. Multiple services are involved. Some communication between services is synchronous, either via RPC or HTTP, and some is asynchronous, using an event queue. To be able to understand how services are performing, you need to be able to collect multiple data points to monitor and either diagnose issues or prevent them before they even arise.

Monitoring individual services will be of little to no use because services provide isolation but don’t exist isolated from the outside world. Services often depend on each other and on the underlying infrastructure (for example, the network, databases, cache stores, and event queues). You can get a lot of valuable information by monitoring services, but you need more. You need to understand what’s going on in all your layers.

c11_01.png

Figure 11.1 Components of a monitoring stack — metrics, traces, and logs — each aggregated in their own dashboards

c11_02.png

Figure 11.2 Services involved in placing orders and their communication protocols

Your monitoring solution should allow you to know what is broken or degrading and why. You’ll be able to quickly reveal any symptoms and use the available monitors to determine causes.

Referring to figure 11.2, it’s worth mentioning that symptoms and causes vary depending on the observation point. If the market service might be having issues communicating with the stock exchange, you can diagnose that by measuring response times or HTTP status codes for that interaction. In that situation, you’ll be almost sure that the place order feature won’t be working as expected.

But what if you have an issue with connectivity from services to the event queue? Services won’t be publishing messages, so downstream services won’t be consuming them. In that situation, no service is failing because no service is performing any work. If you have proper monitoring in place, it can alert you to the abnormal decrease in throughput. You can set your monitoring solution to send you automated notifications when the number of messages in a given queue goes below a certain threshold.

Lack of messages isn’t the only thing that can indicate issues, though. What if you have messages accumulating in a given queue? Such accumulation may indicate the services that consume messages from the queue are either not working properly or are having trouble keeping up with increased demand. Monitoring allows you to identify issues or even predict load increases and act accordingly to maintain service quality. Let’s take some time for you to learn a bit more about the signals you should collect.

11.1.2 Golden signals

You should focus on four golden signals while collecting metrics from any user-facing system: latency, errors, traffic, and saturation.

Latency

Latency measures how much time passes between when you make a request to a given service and when the service completes the request. You can determine a lot from this signal. For example, you can infer that the service is degrading if it shows increasing latency. You need to take extra care, though, in correlating this signal with errors. Imagine you’re serving a request and the application responds quickly but with an error? Latency has a low value in this case, but the outcome isn’t the desired one. It’s important to keep the latency of requests that result in errors out of this equation, because it can be misleading.

Errors

This signal determines the number of requests that don’t result in a successful outcome. The errors may be explicit or implicit — for example, having an HTTP 500 error versus having an HTTP 200 but with the wrong content. The latter isn’t trivial to monitor for because you can’t rely solely on the HTTP codes, and you may only be able to determine the error by finding wrong content in other components. You generally catch these errors with end-to-end or contract tests.

Traffic

This signal measures the demand placed on a system. It can vary depending on the type of system being observed, the number of requests per second, network I/O, and so on.

Saturation

At a given point, this measures the capacity of the service. It mainly applies to resources that tend to be more constrained, like CPU, memory, and network.

11.1.3 Types of metrics

While collecting metrics, you need to determine the type that’s best suited for a given resource you’re aiming to monitor.

Counters

Counters are a cumulative metric representing a single numerical value that’ll always increase. Examples of metrics using counters are:

  • Number of requests
  • Number of errors
  • Number of each HTTP code received
  • Bytes transmitted

You shouldn’t use a counter if the metric it represents can also decrease. For that, you should use a gauge instead.

Gauges

Gauges are metrics representing single numerical arbitrary values that can go up or down. Some examples of metrics using gauges are:

  • Number of connections to a database
  • Memory used
  • CPU used
  • Load average
  • Number of services operating abnormally

Histograms

You use histograms to sample observations and categorize them in configurable buckets per type, time, and so on. Examples of metrics represented by histograms are:

  • Latency of a request
  • I/O latency
  • Bytes per response

11.1.4 Recommended practices

As we already mentioned, you should make sure you instrument as much as possible to collect as much data as you can about your services and infrastructure. You can use the collected data at later stages once you devise new ways to correlate and expose it. You can’t go back in time to collect data, but you can make data available that you previously collected.

Keep in mind that you should go about representing that data, showing it in dashboards, and setting up alerts in a progression to avoid having too much information at once that will be hard to reason through. There is no point in throwing every single collected metric for a service into one dashboard. You can create several dashboards per service with detailed views, but keep one top-level dashboard with the most important information. This dashboard should allow you, in a glance, to determine if a service is operating properly. It should give a high-level view of the service, and any more in-depth information should appear in more specialized dashboards.

When representing metrics, you should focus on the most important ones, like response times, errors, and traffic. These will be the foundation of your observability capabilities. You also should focus on the right percentiles for each use case: 99th, 95th, 75th, and so on. For a given service, it may be good enough if only 95% of your requests take less than x seconds, whereas on another service you may require 99% of the requests to be below that time. There is no fixed rule for which percentile to focus on — that generally depends on the business requirements.

Whenever possible, you should use tags to provide context to your metrics. Examples of tags to associate with metrics are:

  • Environment: Production, Staging, QA
  • User ID

By tagging metrics, you can group them later on and perhaps come up with some more insights. Take, for example, a response time you’ve tagged with the User ID; you can group the values by user and determine if all of the user base or only a particular group of users experiences an increase in response times.

Make sure you always abide by some defined standards when you’re naming metrics. It’s important that you maintain a naming scheme across services. One possible way of naming metrics is to use the service name, the method, and the type of metric you wish to collect. Here are some examples:

  • orders_service.sell_shares.count
  • orders_service.sell_shares.success
  • fees_service.charge_fee.failure
  • account_transactions_service.request_reservation.max
  • gateway.sell_shares.avg
  • market_service.place_order.95percentile

11.2 Monitoring SimpleBank with Prometheus and Grafana

You need to send the metrics you collect from your services and infrastructure to a system capable of aggregating and displaying them. The system will use those collected metrics to provide alerting capabilities. For that purpose, you’ll be using Prometheus to collect metrics and Grafana to display them:

  • Prometheus (https://github.com/prometheus) is an open source systems monitoring and alerting toolkit originally built at SoundCloud. It’s now a standalone open source project and is maintained independent of any company.
  • Grafana (https://grafana.com) is a tool that allows building dashboards on top of multiple metrics data sources, such as Graphite, InfluxDB, and Prometheus.

You’ll do all your setup using Docker. In chapter 7, you already added to your services the ability to emit metrics via StatsD. You’ll keep those services unchanged and add something to your setup to convert metrics from StatsD format to the format that Prometheus uses. You’ll also add a RabbitMQ container that’s already set up to send metrics to Prometheus. Figure 11.3 shows the components you’ll be adding to set up your monitoring system.

c11_03.png

Figure 11.3 The containers you need to build your monitoring system: StatsD server, StatsD exporter, Prometheus, and Grafana

You’ll be using both Prometheus and StatsD metrics as a way to show how two types of metrics collection protocols can coexist. StatsD is a push-based tool, whereas Prometheus is a pull-based tool. Systems using StatsD will be pushing data to a collector service, whereas Prometheus will pull that data from the emitting systems.

11.2.1 Setting up your metric collection infrastructure

You’ll start by adding the services described in figure 11.2 to the Docker compose file, then you’ll focus on configuring both the StatsD exporter and Prometheus. The last step will be to create the dashboards in Grafana and start monitoring the services and the event queue. All the code is available in the book’s code repository.

Adding components to the Docker compose file

The Docker compose file (see the next listing) will allow you to boot all the services and infrastructure needed for the place orderfeature. For the sake of brevity, we’ll omit the individual services and will only list the infrastructure- and monitoring-related containers.

Listing 11.1 docker-compose.yml file

(…)
rabbitmq:    ①  
    container_name: simplebank-rabbitmq
    image: deadtrickster/rabbitmq_prometheus
    ports:
      - "5673:5672"
      - "15673:15672"

  redis:
    container_name: simplebank-redis
    image: redis
    ports:
      - "6380:6379"
  statsd_exporter:    ②  
    image: prom/statsd-exporter
    command: "-statsd.mapping-config=/tmp/
➥statsd_mapping.conf"    ③  
    ports:
      - "9102:9102"
      - "9125:9125/udp"
    volumes:
      - "./metrics/statsd_mapping.conf:/tmp/statsd_mapping.conf"
  prometheus:    ④  
    image: prom/prometheus
    command: "--config.file=/tmp/prometheus.yml 
➥--web.listen-address '0.0.0.0:9090'"    ⑤  
    ports:
      - "9090:9090"
    volumes:
      - "./metrics/prometheus.yml:/tmp/prometheus.yml"
  statsd:    ⑥  
    image: dockerana/statsd
    ports:
      - "8125:8125/udp"
      - "8126:8126"
    volumes:
      - "./metrics/statsd_config.js:/src/statsd/
➥config.js"    ⑦  
  grafana:    ⑧  
    image: grafana/grafana
    ports:
      - "3900:3000"

Configuring StatsD exporter

As we mentioned before, the services involved in the place order feature emit metrics in the StatsD format. In table 11.1, we list all the services and the metrics each one emits. The services will all be emitting timer metrics.

Table 11.1 Timer metrics emitted by the services involved in placing an order
ServiceMetrics
Account transactions request_reservation
Feescharge_fee
Gatewayhealth, sell_shares,
Marketrequest_reservation, place_order_stock_exchange
Orderssell_shares, request_reservation, place_order

The mapping config file allows you to configure each metric that StatsD collects and add labels to it. The following listing provides the mapping you’ll create as a configuration file for the statsd-exporter container.

Listing 11.2 Configuration file to map StatsD metrics to Prometheus

simplebank-demo.account-transactions.request_reservation    ①  
name="request_reservation"    ②  
app="account-transactions"    ③  
job="simplebank-demo"    ④  
simplebank-demo.fees.charge_fee    ⑤  
name="charge_fee"
app="fees"
job="simplebank-demo"
simplebank-demo.gateway.health    ⑥  
name="health"
app="gateway"
job="simplebank-demo"

simplebank-demo.gateway.sell_shares
name="sell_shares"
app="gateway"
job="simplebank-demo"    ⑥  
simplebank-demo.market.request_reservation    ⑦  
name="request_reservation"
app="market"
job="simplebank-demo"

simplebank-demo.market.place_order_stock_exchange
name="place_order_stock_exchange"
app="market"
job="simplebank-demo"    ⑦  
simplebank-demo.orders.sell_shares    ⑧  
name="sell_shares"
app="orders"
job="simplebank-demo"

simplebank-demo.orders.request_reservation
name="request_reservation"
app="orders"
job="simplebank-demo"

simplebank-demo.orders.place_order
name="place_order"
app="orders"
job="simplebank-demo"    ⑧  

If you don’t map the above metrics to Prometheus, they’ll still get collected, but the way they’ll be collected is less convenient. In the figure 11.4 example, you can see the difference between mapped and unmapped metrics fetched from Prometheus from the statsd_exporter service.

c11_04.png

Figure 11.4 Prometheus screenshot with collected SimpleBank metrics — The top two metrics aren’t mapped in the statsd_mapping.conf file, whereas the last one is.

As you can observe in figure 11.4, when the unmapped create_event metrics that both the market and orders service emit reach Prometheus, they’re collected as:

  • simplebank_demo_market_create_event_timer
  • simplebank_demo_orders_create_event_timer

For the request_reservation_timer metric that the market, orders, and account transactions services emit, there’s only one entry, the metric is the same, and the differentiation is in the metadata:

request_reservation_timer{app="*",exported_job="simplebank-demo",exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.5"}    ①  

request_reservation_timer{app="*",exported_job="simplebank-demo",exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.9"}
request_reservation_timer{app="*",exported_job="simplebank-demo",exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.99"}
simplebank_demo_market_create_event_timer{exporter="statsd",instance="statsd-exporter:9102",job="statsd_exporter",quantile="0.5"}    ②  

Configuring Prometheus

Now that you’ve configured the StatsD exporter, it’s time to configure Prometheus for it to fetch data from both the StatsD exporter and RabbitMQ, as shown in the following listing. Both of these sources will be available as targets for metrics data fetching.

Listing 11.3 Prometheus configuration file

global:
  scrape_interval:     5s    ①  
  evaluation_interval: 10s
  external_labels:
      monitor: 'simplebank-demo'

alerting:
  alertmanagers:
  - static_configs:
    - targets:
scrape_configs:    ②  
  - job_name: 'statsd_exporter'    ③  
    static_configs:
      - targets: ['statsd-exporter:9102']
        labels:
          exporter: 'statsd'
    metrics_path: '/metrics'

  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15672']    ④  
        labels:
          exporter: 'rabbitmq'
    metrics_path: '/api/metrics'    ④  

Setting up Grafana

To receive metrics in Grafana, you need to set up a data source. First, you can boot your applications and infrastructure by using the Docker compose file. This will allow you to access Grafana on port 3900, as follows.

Listing 11.4 Grafana setup in the docker-compose.yml file

  (...)
  grafana:
    image: grafana/grafana    ①  
    ports:
      - "3900:3000"    ②  

To start all applications and services using Docker compose, you need to get inside the folder containing the compose file and issue the up command:

chapter-11$ docker stop $(docker ps | grep simplebank | 
➥awk '{print $1}')  ①  
chapter-11$ docker rm $(docker ps -a | grep simplebank | 
➥awk '{print $1}')  ②  
chapter-11$ docker-compose up --build --remove-orphans    ③   

Starting simplebank-redis ...
Starting chapter11_statsd-exporter_1 ...
Starting chapter11_statsd_1 ...
Starting simplebank-rabbitmq ...
Starting chapter11_prometheus_1 ...
Starting simplebank-rabbitmq ... done
Starting simplebank-gateway ...
Starting simplebank-fees ...
Starting simplebank-orders ...
Starting simplebank-market
Starting simplebank-account-transactions ... done
Attaching to chapter11_prometheus_1, simplebank-redis, chapter11_statsd_1, simplebank-rabbitmq, chapter11_statsd-exporter_1, simplebank-gateway, simplebank-fees, simplebank-orders, simplebank-market, simplebank-account-transactions
(…)

The output of the docker-compose up command will allow you to understand when all services and applications are ready. You can reach applications using the URL assigned to Docker or the IP address. By appending port 3900 as configured in the docker-compose.yml file, you can access Grafana’s login screen as shown in figure 11.5. You’ll be accessing Grafana using the default login credentials: username and password are both admin.

c11_05.png

Figure 11.5 Grafana login screen

Once you log in, you’ll have an Add Data Source option. Figure 11.6 shows the data source configuration screen, Edit Data Source. To configure a Prometheus data source in Grafana, you need to select Prometheus as the type and insert the URL of the running Prometheus instance, in your case http://prometheus:9090, as configured in the Docker compose file.

The Save & Test button will give you instant feedback on the data source status. Once it’s working, you’re ready to use Grafana to build dashboards for your collected metrics. In the next few sections, you’ll be using it to display metrics both for the services that enable the place orders functionality in SimpleBank and for monitoring a critical piece of the infrastructure, RabbitMQ, the event queue.

11.2.2 Collecting infrastructure metrics — RabbitMQ

To set up the dashboard to monitor RabbitMQ, you’ll be using a json configuration file. This is a convenient and easy way to share dashboards. In the source code repository, you’ll find a grafana folder. Inside that, a RabbitMQ Metrics.json file holds the configuration for both the dashboard layout and the metrics you want to collect. You can now import that file to have your RabbitMQ monitoring dashboard up in no time!

c11_06.png

Figure 11.6 Configuring a Prometheus data source in Grafana

Figure 11.7 shows how you can access the import dashboard functionality in Grafana. By clicking Grafana’s logo, you bring up a menu; if you hover over Dashboards, the Import option will be available.

The import option will bring up a dialog box that enables you to either paste the json in a text box or upload a file. Before you can use the imported dashboard, you need to configure the data source that’ll feed the dashboard. In this case, you’ll be using the SimpleBank data source you configured previously.

That’s all it takes to have your RabbitMQ dashboard up and running. In figure 11.8 you can see how it looks.

c11_07.png

Figure 11.7 Importing a dashboard from a json file

c11_08.png

Figure 11.8 RabbitMQ metrics collected via Prometheus and displayed in Grafana

Your RabbitMQ dashboard provides you an overview of the system by displaying a monitor for the server status that shows if it’s up or down, along with graphs for Exchanges, Channels, Consumers, Connections, Queues, Messages per Host, and Messages per Queue. You can hover over any graph to display details for metrics at a point in time. Clicking the graph’s title will bring up a context menu that allows you to edit, view, duplicate, or delete it.

11.2.3 Instrumenting SimpleBank’s place order

Now that you have services up and running, along with the monitoring infrastructure, Prometheus and Grafana, it’s time to collect the metrics described in table 11.1. You can start by loading a dashboard exported as json that you can find in the source directory under the grafana folder (Place Order.json). Follow the same instructions as the ones in 11.2.2 for the RabbitMQ dashboard.

Figure 11.9 displays the dashboard collecting metrics for the services involved in the place order feature. By clicking on each of the panel titles, you can view, edit, duplicate, share, and delete each of the panels.

This loaded dashboard collects the time metrics and displays the 0.5, 0.9, and 0.99 quantiles for each metric. In the top right corner, you find the manual refresh button as well as the period for displaying metrics. By clicking the Last 5 Minutes label, you can select another period for displaying metrics, as shown in figure 11.10. You can select one of the Quick Ranges values or create a custom one, and you can display stored metrics in any range you need.

c11_09.png

Figure 11.9 Place order dashboard accessible at Grafana’s /dashboard/db/place-order endpoint

c11_10.png

Figure 11.10 Selecting the time range for which you want metrics to be displayed

Let’s focus on the Market | Place Order Stock Exchange panel to see in detail how you can configure a specific metric display. To do so, click the panel title and then select the Edit option. Figure 11.11 shows the edit screen for the Market | Place Order Stock Exchange.

The edit screen has a set of tabs (1) you can select to configure different options. The highlighted one is the Metrics tab, where you can add and edit metrics to be displayed. In this particular case, you’re only collecting a metric (2), namely the place_order_stock_exchange_timer that gives you the time it took for the market service to place an order into the stock exchange. The default display for a metric contains metadata like the app name, the exported job, and the quantile. To change the way the legend is presented, you set a Legend Format (3). In this case, you set the name and use {{quantile}} block that’ll be interpolated to display the quantile in both the graph legend and the hovering window next to the vertical red line. (The red line acts as a cursor when you move your mouse across the collected metrics.) In your dashboards, you’re displaying the min, max, avg, and current values for each quantile (4).

c11_11.png

Figure 11.11 Panel edit screen for Market | Place Order Stock Exchange

The dashboard you’ve set up is quite simple, but it allows you to have an overview of how the system is behaving. You’re able to collect time-related metrics for several actions that services in your system perform.

11.2.4 Setting up alerts

Now that you’re collecting metrics and storing them, you can set up alerts for when values deviate from what you consider as normal for a given metric. This can be an increase in the time taken to process a given request, an increase in the percent of errors, an abnormal variation in a counter, and so on.

In your case, you can consider the market service and set up an alert for knowing when the service needs to be scaled. Once you place a sell order via the gateway service, a lot goes on. Multiple events are fired, and you know the bottleneck tends to be the market service processing the place order event. The good thing is you can set up an alert to send a message whenever messages in the market place order queue go above a certain threshold. You can configure multiple channels for notifications: email, slack, pagerduty, pingdom, webhooks, and so on.

You’ll be setting up a webhook notification to receive a message in your alert server every time the number of messages goes above 100 in any message queue. For now, you’ll only be receiving it in an alert service made with the purpose of illustrating the feature. But you could easily change this service to trigger an increase in the number of instances of a given service to increase the capacity to process messages from a queue.

The alert service is a simple app that also booted when you started all other apps and services. It’ll be listening for incoming POST messages, so you can go ahead and configure the alerts in Grafana. Figure 11.12 shows the activity for the market place order event queue with indication of alerts, both when they were triggered and when the alert condition ceased. When you set up alerts, Grafana will indicate as an overlay both the threshold set for alerting (1) and the instants when alerts were triggered (2, 4, 6) and resolved (3, 5, 7).

With the current setup, the alert service sends an alert message as a webhook when the number of messages in any queue goes over 100. The following shows you one of those alert messages:

alerts.alert.d26ab4ca-1642-445f-a04c-41adf84145fd: 
{
  "evalMatches": [
    {
      "value":158.33333333333334,    ①  
      "metric":"evt-orders_service-order_created
➥--market_service.place_order",    ②  
      "tags":{
        "__name__":"rabbitmq_queue_messages",
        "exporter":"rabbitmq",
        "instance":"rabbitmq:15672",
        "job":"rabbitmq",
        "queue":"evt-orders_service-order_created
➥--market_service.place_order",    ③  
        "vhost":"/"
      }
    }
  ],
  "message":"Messages accumulating in the queue",
  "ruleId":1,
  "ruleName":"High number of messages in a queue",    ④  
  "ruleUrl":"http://localhost:3000/dashboard/db/rabbitmq-metrics?fullscreen
➥u0026editu0026tab=alertu0026panelId=2u0026orgId=1",
  "state":"alerting",    ⑤  
  "title":"[Alerting] High number of messages in a queue"
}
c11_12.png

Figure 11.12 The message queue’s status showing alert overlays

Likewise, when the number of messages in a queue goes below the value defined as the threshold for alerting, the service also issues a message to notify about it:

alerts.alert.209f0d07-b36a-43f4-b97c-2663daa40410: 
{
  "evalMatches":[],
  "message":"Messages accumulating in the queue",
  "ruleId":1,
  "ruleName":"High number of messages in a queue",    ①  
  "ruleUrl":"http://localhost:3000/dashboard/db/rabbitmq-metrics?fullscreen
➥u0026editu0026tab=alertu0026panelId=2u0026orgId=1",
  "state":"ok",  ②  
  "title":"[OK] High number of messages in a queue"
}

Let’s now see how you can set up this alert for the number of messages in queues. You’ll also be using Grafana for setting up the alert, because it offers this capability and the alerts will display on the panels they relate to. You’ll be able to both receive notifications and check the panels for previous alerts.

You’ll start by adding a notification channel that’ll you’ll use to propagate alert events. Figure 11.13 shows how to create a new notification channel

To set up a new notification channel in Grafana, follow these steps:

  1. Click the Grafana icon on the top left of the screen.
  2. Under the Alerting menu, select Notification Channels.
    c11_13.png

    Figure 11.13 Setting up a new notification channel in Grafana

  3. Enter the name for the channel and select the type as Webhook, then check the Send on All Alerts option.
  4. Enter the URL for the service receiving the alerts. In your case, you’ll be using the alerts service and listening for POST requests.
  5. Click the Send Test button to verify all is working, and if so, click Save to save the changes.

Now that you have an alert channel set up, you can go ahead and create alerts on your panels. You’ll be setting an alert on the messages queue panel under the RabbitMQ dashboard you created previously. Clicking the Messages/Queue panel title will bring up a menu where you can select Edit. This allows you to create a new alert under the Alert tab. Figure 11.14 shows how to set up a new alert.

Under the Alert Config screen, start by adding the Name for the alert as well as the frequency at which you want the condition to be evaluated — in this case every 30 seconds. The next step is to set the Conditions for the alert. You’ll be setting an alert to notify you whenever the average of the values collected from query A is above 100 in the last minute.

Under the Alert tab, you also can check the history of the configured alerts. Figure 11.15 shows the alert history for the number of messages in queues.

c11_14.png

Figure 11.14 Setting up alerts on the Messages/Queue graph on the RabbitMQ Dashboard

c11_15.png

Figure 11.15 Displaying the state history for a given alert

That’s it, you’re done! You’ve set up a monitoring infrastructure to collect both metrics that your services already emitted and those that come from a key component that those services use to communicate asynchronously: the event queue. You’ve also seen how to create alerts to be notified whenever certain conditions in your system are met. Let’s now dig a bit deeper into alerts and how to use them.

11.3 Raising sensible and actionable alerts

Having a monitoring infrastructure in place means you can measure system performance and keep a historic record of those measures. It also means you can determine thresholds for your measures and automatically emit notifications when those thresholds are exceeded.

One thing you need to keep in mind, though, is that it’s easy to reach a stage where all this information can become overwhelming. Eventually, the overload of information can do more harm than good (for example, if it gets so bad that people start ignoring recurring alerts). You need to make sure that the alerts you raise are actionable — and actioned — and that they’re targeting the correct people in the organization.

Although services may consume and take action on some of the alerts automatically; for example, autoscaling a service if messages are accumulating in a queue, humans need to consume and take action on some alerts. You want those alerts to reach the correct people and contain enough information so that diagnosing the cause becomes as easy as possible.

You also need to prioritize alerts, because most likely any issue with your services or infrastructure will trigger multiple alerts. Whoever is dealing with those alerts needs to know immediately the urgency of each one. As a rule, you should direct alerts for services to the teams owning those services. You should map the application into the organization, because this helps with determining the targets for alerts.

11.3.1 Who needs to know when something is wrong?

In day-to-day operation, alerts should target the team who owns the service and originated it. This reflects the “you build it, you run it” mantra that should govern a microservices-oriented engineering team. As teams create and deploy services, it’s hard, if not impossible, for everyone to know about every service deployed. People with the most knowledge about a service will be in the best position to interpret and take action in response to alerts that the service generates.

Organizations also may have some on-call rotation or a dedicated team that’ll receive and monitor alerts and then escalate if necessary to specialized teams. When setting up alerts and notifications, it’s important to keep in mind that other people may consume them, so you should keep those alerts as concise and informative as possible. It’s also important that each service have some sort of documentation on common issues and diagnosing recipes so that on-call teams can, when they receive an alert, determine if they can fix the issue or if they need to escalate it.

You also should categorize alerts by levels of urgency. Not every issue will need immediate attention, but some are deal breakers that you need to address as soon as you know about them.

Severe issues should trigger a notification to ensure someone, either an engineer from the team that built the service or an on-call engineer, is notified. Issues that are of moderate severity should generate alerts as notifications in any channels deemed appropriate, so those monitoring them can pick them up. You can think of this type of alert as something generating a queue of tasks that you need to carry out as soon as possible but not immediately — they don’t need to interrupt someone’s flow or wake someone up in the middle of the night. The lowest priority alerts are those that only generate a record. These alerts aren’t strictly for human consumption, because services can receive them and take some kind of action if needed (for example, autoscaling a service when response times increase).

11.3.2 Symptoms, not causes

Symptoms, not causes, should trigger alerts. An example of this is a user-facing error; if users can no longer access a service, that inability should generate an alert. You shouldn’t be tempted to trigger alerts for every single parameter that isn’t under the normal threshold. With such partial information, you won’t be able to know what’s going on or what the problem is. In figure 11.2, we illustrated the flow for placing orders in the stock market. Four services cooperate with a gateway that works as the access point for the consumer of the feature. One or more of the services may be exhibiting erroneous behavior or be overloaded. Given the mainly asynchronous nature of the communication between components, it may be hard to pinpoint why a given error may be happening.

Imagine you set an alert that relates the number of requests reaching the gateway and the number of issued notifications of orders placed. It’ll be simple to correlate those two metrics over time and determine the ratio between the two. You’ll have a symptom: the number of orders placed is greater than the ones completed. You can start from there and then try to understand which component is failing (maybe even multiple components). Is it the event queue or an infrastructure problem? Is the system under high loads and can’t cope? The symptom will be the starting point for your investigation, and from there you should follow the leads until you find the cause or causes.

11.4 Observing the whole application

Correlating metrics can be a precious tool to infer and understand more than a per-service state of the system. Monitoring can also help you understand and reason through the behavior of the system under different conditions, and this can help you to predict and adjust your capacity by using all the collected data. The good thing about collecting per-service metrics is you can iteratively correlate them between different services and have an overall idea of the behavior of the whole application. In figure 11.16, you can see a possible correlation of different service metrics.

Let’s look into each of the suggested correlations:

  • A: Creating a new visualization comparing the rate of incoming requests to the gateway and the orders service — This allows you to understand if there are any issues in processing the incoming requests from your users. You also can use the new correlation to set an alert every time that rate drops from 99%.
  • B: Correlating the number of user requests made to the gateway with the number of order-created messages in the queue — Given that you know the order service is responsible for publishing those messages, this will, similarly to A, allow you to understand if the system is working correctly and customer requests are being processed.
  • C: Correlating the number of order-placed messages with the number of requests to the order service — This will allow you to infer if the fee service is working properly.
c11_16.png

Figure 11.16 Correlation of metrics between different services

Combining different metrics into new dashboards and setting sensible alerts on them allows you to gain insights into the overall application. It’s then up to you to determine the desired level of detail, from a high-level view to a detailed one.

So far, we’ve covered monitoring and alerting. You’ve set up a monitoring stack to be able to understand how things happened. You’re now able to understand the status of services, observe the metrics they emit, and determine if they’re operating within expected parameters. This is only part of the application observability effort. It’s a good starting point, but you do need more!

To be able to fully understand what’s going on, you need to invest some more in logging and tracing so you can have both a current view of what’s happening and a view of what happened before. In the next chapter, we’ll focus on logging and tracing as a complement to monitoring in your journey into observability. Doing so will help you to understand why things happened.

Summary

  • A robust microservice monitoring stack consists of metrics, traces, and logs.
  • Collecting rich data from your microservices will help you identify issues, investigate problems, and understand your overall application behavior.
  • When collecting metrics, you should focus on four golden signals: latency, errors, traffic (or throughput), and saturation.
  • Prometheus and StatsD are two common, language-independent tools for collecting metrics from microservices.
  • You can use Grafana to graph metric data, create human-readable dashboards, and trigger alerts.
  • Alerts based on metrics are more durable and maintainable if they indicate the symptoms of incorrect system behavior, rather than the causes.
  • Well-defined alerts should have a clear priority, be escalated to the right people, be actionable, and contain concise and useful information.
  • Collecting and aggregating data from multiple services will allow you to correlate and compare distinct metrics to gain a rich overall understanding of your system.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset