© Kasun Indrasiri and Prabath Siriwardena 2018
Kasun Indrasiri and Prabath SiriwardenaMicroservices for the Enterprisehttps://doi.org/10.1007/978-1-4842-3858-5_13

13. Observability

Kasun Indrasiri1  and Prabath Siriwardena1
(1)
San Jose, CA, USA
 

Collecting data is cheap, but not having it when you need it can be expensive. In March 2016, Amazon was down for 20 minutes and the estimated revenue loss was $3.75 million. Also in January 2017, there was a system outage at Delta Airlines, which caused cancellation of more than 170 flights and resulted in an estimated loss of $8.5 million. In both cases if we had the right level of data collected, we could have predicted such behavior or could have recovered from it as soon as it has happened by identifying the root cause. The more information we have, the better decisions we can make.

Observability is the measure of how well internal states of a system can be inferred from knowledge of its external outputs1. It is one of the most important aspects, which needs to be baked into any microservices design. We need to track the throughput of each microservice, the number of success/failed requests, utilization of CPU, memory and other network resources, and some business-related metrics. In this chapter, we discuss the need for observability, the role logging, metrics, and tracing play in observability, how to build a distributed tracing system with Spring Cloud Sleuth, Zipkin, and Jaeger, and how visualization, monitoring, and alerting work with Prometheus and Grafana.

Three Pillars of Observability

Observability can be achieved in three ways: logging, metrics, and tracing, which are also known as the three pillars of observability. Logging is about recording events. It can be anything. Each transaction that goes through a microservice can be logged with other related metadata, including the timestamp, the status (success/failure), initiator, etc. Combining data from measuring events derives metrics. For example, the number of transactions processed in a unit time and the transaction success/failure rate are metrics about your microservice. Metrics are an indication of how well (or how poorly) your service is doing. Another example would be the latency. The logs capture the timestamp of every request that hits a microservice and the corresponding responses. The difference between those two timestamps is the latency of a given request. The average latency of a given service is derived by taken into consideration all such time differences over the time, which is a metric. Here, latency as a metric helps us decide whether our microservice is slow or fast. Also, when we want to set alerts, we always pick metrics. If we want to be alerted when the system starts to perform slower than the expectation, we probably could set up an alert on latency. When the average latency falls behind a preset threshold, the system will trigger an alert. In summary, metrics help us identify trends.

Tracing is also derived from logs. It’s a different view on how a system behaves, which takes into consideration the ordering of each event and the impact of one on another. Say for example that tracing helps you find the root cause of why a request that comes to the Billing microservice fails, by tracing it back to the Order Processing microservice. It doesn’t need to be across different systems all the time; it can be just within one system. For example, if it takes 90 milliseconds to place an order, the tracing should show us where exactly the delay is and how different components contribute to it.

Distributed Tracing with Spring Cloud Sleuth

Distributed tracing helps in tracking a given request, which spans across multiple microservices. Due to the nature of microservices, in most cases, to cater a single request from a client, more than one microservices are consumed. Figure 13-1 shows all the interactions between microservices that could happen during a single request to place an order. Some requests are direct service-to-service invocations, while others are asynchronous based on a messaging system. Irrespective of how service-to-service communication takes place, distributed tracing provides a way to track a request, which spans across all the microservices required in building the response to the client.

Distributed tracing adds value not just for microservices, but also for any distributed system. Whenever a request passes through different components (say an API gateway, Enterprise Service Bus) in a network, you need to have the ability to track it across all the systems. That helps you to identify and isolate issues related to latency, message loses, throughput, and many more.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig1_HTML.jpg
Figure 13-1

Communication between microservices

Spring Cloud Sleuth

Spring Cloud Sleuth implements a distributed tracing solution for Spring microservices2. Sleuth borrows many concepts and terminology from Dapper3, which is Google’s production distributed systems tracing infrastructure. The basic unit of work in Sleuth is called a span. A span represents the work carried out between two points in a communication network. For example, the Order Processing microservice (see Figure 13-1) receives an order from a client and processes the order. Then it synchronously talks to the Inventory microservice and, once it receives the response, publishes the ORDER_PROCESSING_COMPLETED event to the messaging system. Figure 13-2 shows how spans are identified between different points in the complete communication network. The span gets its initial value once the corresponding request hits the Order Processing microservice. In fact, the value of span is a 64-bit identifier (even though we use alphabetical letters to denote a span in Figure 13-2). Any message logged inside the Order Processing microservice will carry the span ID A. The request sent from Order Processing microservice and received by the Inventory microservice will carry the span ID B, while the request is inside the Inventory microservice, so it will hold the span ID C.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig2_HTML.jpg
Figure 13-2

Distributed tracing

Each span has a parent span. For example, the parent span for span B is span A, while the parent span of span A is null. Similarly, span A is also a parent of span E. Figure 13-3 arranges spans by the parent-child relationship. A set of spans forming a tree-like structure is known as a trace. The value of the trace remains the same through out all the spans for a given request. As per Figure 13-2, the value of the trace is A, and it carries the same value across all the spans. The trace ID helps correlate messages between microservices. Once all the logs from different microservices are published into a centralized tracing system, given the trace ID, we can trace the message across different systems.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig3_HTML.jpg
Figure 13-3

A set of spans forming a tree-like structure

Let’s get our hands wet! Let’s see how to use Spring Cloud Sleuth to do distributed tracing with a set of example Spring microservices.

Note

To run the examples in this chapter, you need Java 8 or latest, Maven 3.2 or latest, and a Git client. Once you have successfully installed those tools, you need to clone the Git repo: https://github.com/microservices-for-enterprise/samples.git . The chapter’s examples are in the ch13 directory.

:> git clone https://github.com/microservices-for-enterprise/samples.git

Engaging Spring Cloud Sleuth with Spring Boot Microservices

Engaging Sleuth with Spring Boot is quite straightforward. Once you download all the examples from the Git repository, you can find the source code related to this example available in the ch13/sample01 directory.

Note

A comprehensive introduction to Spring Cloud Sleuth is out of the scope of this book, and we recommend readers looking for details to refer to Sleuth documentation available at https://cloud.spring.io/spring-cloud-sleuth/single/spring-cloud-sleuth.html .

Let’s look at some of the notable Maven dependencies added to the ch13/sample01/pom.xml file. The spring-cloud-starter-sleuth dependency brings in all the dependencies related to Sleuth. Once engaged to a Maven project, Sleuth spans and traces will be automatically added to all the logs. Sleuth intercepts all the HTTP requests coming to a given microservice and inspects all the messages to see whether any tracing information is already available. If so, will extract them out and make those available to the corresponding microservice. Also, Sleuth injects the tracing information to the Spring Mapped Diagnostic Context (MDC), so that logs created from a microservice will automatically include the tracing data. When an HTTP request goes out from a microservice, once again Sleuth injects the tracing information to the outbound request or to the response.
<dependency>
       <groupId>org.springframework.cloud</groupId>
       <artifactId>spring-cloud-starter-sleuth</artifactId>
       <version>2.0.0.RC1</version>
</dependency>
Let’s look at the source code (ch13/sample01/src/main/java/com/apress/ch13/sample01/service/OrderProcessing.java), which logs data related to the order retrieval requests. The logging API used here to log messages has nothing related to Sleuth—it’s simply the slf4j4 API.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@RequestMapping(value = "/{id}", method = RequestMethod.GET)
public ResponseEntity<?> getOrder(@PathVariable("id") String orderId) {
 logger.info("retrieving order:" + orderId);
 Item book1 = new Item("101", 1);
 Item book2 = new Item("103", 5);
 PaymentMethod myvisa = new PaymentMethod("VISA", "01/22", "John Doe",
                                          "201, 1st Street, San Jose, CA");
 Order order = new Order("101021", orderId, myvisa, new Item[] { book1,
                         book2 },"201, 1st Street, San Jose, CA");
 return ResponseEntity.ok(order);
}
Before we run the code, there are couple of important properties to configure in the ch13/sample01/src/main/resources/application.properties file.
spring.application.name=sample01
spring.sleuth.sampler.percentage=0.1
The value of the property spring.application.name is added to the logs as the service name, along with the trace ID and span ID. Once Sleuth is engaged, the service name, trace ID, span ID, and a flag indicating whether the logs are published to Zipkin will be appended to the logs. In the following example, sample01 is the service name (picked from the property file), d25a633196c01c19 (first one) is the trace ID, d25a633196c01c19 (second one) is the span ID, and false indicates that this log is not published to Zipkin. We look into Zipkin later in the chapter, so for the time being, think about it as a server capturing all the tracing information in a microservices deployment.
INFO [sample01,d25a633196c01c19,d25a633196c01c19,false] 27437 --- [nio-9000-exec-2] c.a.c.sample01.service.OrderProcessing: retrieving order:11

The spring.sleuth.sampler.percentage property in the application.properties file indicates what percentage of the requests must be traced. By default it is set to 0.1, which means that only 10% of all the requests will be published to Zipkin. By setting it to 1.0, all the requests will be published.

Now, let’s see how to run the Order Processing microservice and invoke it with the following cURL command (run this from the ch13/sample01 directory).
> mvn clean install
> mvn spring-boot:run
> curl http://localhost:9000/order/11
This cURL command will print the order details, and if you look at the command console that runs the Order Processing microservice, you will find the following log, which includes the trace ID and the span ID along with other metadata.
INFO [sample01,d25a633196c01c19,d25a633196c01c19,false] 27437 --- [nio-9000-exec-2] c.a.c.sample01.service.OrderProcessing: retrieving order:11

Well, we are not surprised if you do not find the example by its own that helpful. It does nothing more than logging—no tracing at all. The next section will clear your doubts and probably will convince you of the value of tracing.

Tracing Messages Between Multiple Microservices with Spring Cloud Sleuth

Let’s extend the example we discussed so far with multiple microservices and see how a given request is traced throughout the communication network. If you have already started the Order Processing microservice (sample01) as per the instructions in the previous section, keep it running. In addition to that you also need to start the Inventory microservice. Let’s spin up the Inventory microservice by running the following command from the ch13/sample02 directory.
> mvn clean install
> mvn spring-boot:run
Now when we run our cURL client to place an order with the Order Processing microservice, it will talk to the Inventory microservice to update the inventory (see Figure 13-1).
> curl -v  -k  -H "Content-Type: application/json" -d '{"customer_id":"101021","payment_method":{"card_type":"VISA","expiration":"01/22","name":"John Doe","billing_address":"201, 1st Street, San Jose, CA"},"items":[{"code":"101","qty":1},{"code":"103","qty":5}],"shipping_address":"201, 1st Street, San Jose, CA"}' http://localhost:9000/order
Let’s look at the command console, which runs the Order Processing microservice. It should print the following log with the tracing information.
INFO [sample01,76f19c035e8e1ddb,76f19c035e8e1ddb,false] 29786 --- [nio-9000-exec-1] c.a.c.sample01.service.OrderProcessing   : creating order :10dcc849-3d8d-49fb-ac58-bc5da29db003
The command console, which runs the Inventory microservice, prints the following log.
INFO [sample04,76f19c035e8e1ddb,be46d1595ef606a0,false] 29802 --- [io-10000-exec-1] c.a.ch13.sample02.service.Inventory      : item code 101
INFO [sample04,76f19c035e8e1ddb,be46d1595ef606a0,false] 29802 --- [io-10000-exec-1] c.a.ch13.sample02.service.Inventory      : item code 103

In both logs printed from the two microservices, the trace ID (76f19c035e8e1ddb) is the same, while each has its own span ID (76f19c035e8e1ddb and be46d1595ef606a0). In the next section, we see how to publish these logs to Zipkin and visualize the complete path of a request across multiple microservices.

Data Visualization and Correlation with Zipkin

Zipkin5 is a distributed tracing system that helps visualizing and correlating the communication paths between microservices. It also helps in diagnosing latency issues by gathering timing data. All the microservices can be instrumented to publish the logs along with tracing information to Zipkin (see Figure 13-4).

Note

A comprehensive introduction to Zipkin is out of the scope of this book, and we recommend readers looking for details to refer to the Zipkin documentation available at https://zipkin.io/ .

Setting up Zipkin is quite straightforward with Docker. In Chapter 8, “Deploying and Running Microservices,” we discussed Docker in detail, and assuming you have Docker up and running in your machine, you can spin up a Docker container with Zipkin, with the following command. In case you want to try Zipkin without Docker, refer to the installation guide available at https://zipkin.io/pages/quickstart.html .
> docker run -d -p 9411:9411 openzipkin/zipkin
This command binds port 9411 of the host machine to port 9411 on the Docker container. Once the Zipkin node gets started, you can access its web-based console from the host machine, via http://localhost:9411/zipkin/, or simply with http://localhost:9411 (see Figure 13-5). The next step is to update the configuration of the Order Processing and Inventory microservices (from the previous section, sample01 and sample02) to publish logs to Zipkin. If you have both the microservices running, first stop them and update the application.properties file with the following. You need to do this for both microservices. The spring.zipkin.baseUrl property carries the server URL of the Zipkin server.
spring.zipkin.baseUrl=http://localhost:9411/
In addition to setting up this property in the application.properties file , we also must add the following dependency to the pom.xml file of both the microservices to complete the Zipkin integration. The spring-cloud-sleuth-zipkin dependency takes care of publishing logs to the Zipkin server, and those have to be in a format understood by Zipkin.
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-sleuth-zipkin</artifactId>
    <version>2.0.0.RC1</version>
</dependency>
Start the Order Processing and Inventory microservices and use the following cURL command to place an order with the Order Processing microservice. Do it for few times, so we gather enough logs at the Zipkin. Also, keep in mind that not all the logs are sent to Zipkin—it depends on the value you set for the spring.sleuth.sampler.percentage property in the application.properties file. In a typical microservices deployment the data volume can be very high, hence the volume of traced data. Based on the volume of requests you get and the business criticality of the operations performed by the microservices, you can decide what percentage of requests needs be sampled or sent to Zipkin.
 > curl -v  -k  -H "Content-Type: application/json" -d '{"customer_id":"101021","payment_method":{"card_type":"VISA","expiration":"01/22","name":"John Doe","billing_address":"201, 1st Street, San Jose, CA"},"items":[{"code":"101","qty":1},{"code":"103","qty":5}],"shipping_address":"201, 1st Street, San Jose, CA"}' http://localhost:9000/order
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig4_HTML.jpg
Figure 13-4

Each microservice is instrumented to publish logs to Zipkin

For the traced data, which is published from each microservice to Zipkin, you’ll find the following logs printed on the command console, which runs each microservice. Notice that the fourth parameter of the tracing information added to each, which indicates whether the logs are published to Zipkin, and that is set to true.
INFO [sample01,bf581ac0009c6e48,bf581ac0009c6e48,true] 30166 --- [nio-9000-exec-1] c.a.c.sample01.service.OrderProcessing : retrieving order:11
INFO [sample02,1a35024149ac7711,98239453fa5582ba,true] 30153 --- [io-10000-exec-7] c.a.ch04.sample04.service.Inventory : item code 101
INFO [sample02,1a35024149ac7711,98239453fa5582ba,true] 30153 --- [io-10000-exec-7] c.a.ch04.sample04.service.Inventory : item code 103

What happens underneath here is that Sleuth instruments your application to generate tracing information in a format understood by Zipkin. Zipkin is the data collector and once all the microservices publish traced data to Zipkin, it helps you to do the distributed tracing. To do distributed tracing, you need both the parts.

Note

Jaeger6 is another open source distributed tracing system inspired by Zipkin and Dapper and developed by Uber.

../images/461146_1_En_13_Chapter/461146_1_En_13_Fig5_HTML.jpg
Figure 13-5

Zipkin web-based console

Now let’s see how to find some useful information from the Zipkin server with the published tracing information. On the home page of the Zipkin web console, in the Service Name dropdown box, you will notice that we have two names: sample01 and sample02. Those are the service names associated with our two microservices and by picking one service name there, you can find all the tracing information related to that microservice. Figure 13-6 shows the tracing information related to sample01 or the Order Processing microservice.

Note

The Zipkin architecture is built with four main components: collector, storage, search, and web UI. The traced data published from applications (or microservices) first hits the collector. The collector validates, stores, and indexes the data for lookups. The storage of Zipkin is pluggable and it natively supports Cassandra, ElasticSearch, and MySQL. Once the data is indexed and stored, the search component of Zipkin provides a JSON API to interact with traces, which is mostly used by the web UI.

../images/461146_1_En_13_Chapter/461146_1_En_13_Fig6_HTML.jpg
Figure 13-6

Tracing information related to the Order Processing microservice

Zipkin also has another nice feature, which builds a dependency graph for your microservice by analyzing inbound and outbound traffic patterns. This is quite useful when we have many microservices in our deployment. Anyway, in this particular example it’s a very simple graph between the Order Processing (sample01) and Inventory (sample02) microservices, as shown in Figure 13-7.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig7_HTML.jpg
Figure 13-7

Dependency graph between the OrderProcessing and Inventory microservices

Event-Driven Log Aggregation Architecture

Let’s revisit Figure 13-4, which is the high-level design of what we are going to discuss in this section. The design here is bit different from what we proposed in Chapter 2, “Designing Microservices” (see Figure 2-17). Figure 13-8 depicts the redone design as per the recommendation for a log aggregator architecture in Chapter 2.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig8_HTML.jpg
Figure 13-8

Zipkin with event-driven log aggregation architecture

Unlike in Figure 13-4 , here there is no direct coupling between the Zipkin server and the microservices. Each microservice publishes the logs to a messaging system, which can be either RabbitMQ or Kafka, and Zipkin picks the logs from the messaging system. The advantage of this model is that even if the Zipkin server is down for some time, microservices can independently keep on publishing logs and Zipkin picks them all from the messaging system when it reboots.

Introduction to Open Tracing

Distributed tracing becomes quite tricky when different microservices in the same microservices deployment use different tracing modules. For example, the Order Processing microservice may use Sleuth to generate spans and traces, while the Inventory microservice uses another module. Both can publish the traced data to Zipkin, but unless both the modules share the same definition for spans and traces, and respect the traced data generated from the each other, this information will be useless. This forces developers to use the same tracing modules in all the microservices, which is sort of a violation of the polyglot architecture, which we proudly talk about with respect to microservices. Open tracing7 is an initiative to address this concern by building an open standard. It precisely defines what’s a span and what’s a trace under the open tracing data model.

Open tracing has to work across multiple programming languages. At the time of this writing, it defines language-level APIs for nine programming languages: Go, Python, JavaScript, Java, C#, Objective-C, C++, Ruby, and PHP. There are several implementations of these APIs already. Jaeger, the open source distributed tracing system developed by Uber, has support for open tracing and includes open tracing client libraries for several programming languages: Java, Go, Python, Node.js, C++, and C#.

Distributed Tracing with Open Tracing Using Spring Boot Microservices and Zipkin

Zipkin has support for open tracing, but not the Sleuth. You can think about Sleuth as the tracing client, while Zipkin is the server that collects all the traced data. In the examples we discussed in previous sections, Sleuth was used as a tracing client for Zipkin, and it used a Zipkin specific format, which won’t work with open tracing. In this section, we see how to publish traced data from a Spring Boot microservice, which is compatible with open tracing. The source code related to this example is available in the ch13/sample03 directory.

Let’s look at some of the notable Maven dependencies added to the ch13/sample03/pom.xml file. The opentracing-spring-cloud-starter and opentracing-spring-zipkin-starter dependencies bring in all the dependencies required to publish open tracing compatible tracing information to Zipkin.
<dependency>
    <groupId>io.opentracing.contrib</groupId>
    <artifactId>opentracing-spring-cloud-starter</artifactId>
    <version>0.1.13</version>
</dependency>
<dependency>
    <groupId>io.opentracing.contrib</groupId>
    <artifactId>opentracing-spring-zipkin-starter</artifactId>
    <version>0.1.1</version>
</dependency>
Assuming you are still running the Zipkin node from the previous section, add the opentracing.zipkin.http-sender.baseUrl property to the application.properties file , which carries the server URL of the Zipkin server.
opentracing.zipkin.http-sender.baseUrl=http://localhost:9411/
Now, let’s run the Order Processing microservice and invoke it with the following cURL command (run from the ch13/sample03 directory). Do it a few times and observe the data recorded at Zipkin via its web console running at http://localhost:9411/zipkin/.
> mvn clean install
> mvn spring-boot:run
> curl http://localhost:9000/order/11

Distributed Tracing with Open Tracing Using Spring Boot Microservices and Jaeger

In the previous section we explained how to publish open tracing compatible traces to Zipkin. Since it’s open tracing, not just Zipkin, any other product supporting open tracing should accept it. In this section, we see how to publish traced data from a Spring Boot microservice to Jaeger, which is compatible with open tracing (see Figure 13-9). Jaeger is another open source distributed tracing system inspired by Zipkin and Dapper and developed by Uber. The source code related to this example is available in the ch13/sample04 directory. We can spin up a Jaeger Docker instance with the following command, which will start running on HTTP port 16686 and UDP port 5775. After it starts, we can access its web console via http://localhost:16686/.
> docker run -d -p 5775:5775/udp -p 16686:16686 jaegertracing/all-in-one:latest
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig9_HTML.jpg
Figure 13-9

Jaeger web console

Note

A comprehensive introduction to Jaeger is out of the scope of this book, and we recommend readers looking for details to refer to the Jaeger documentation available at https://www.jaegertracing.io/docs/ .

Let’s look at some of the notable Maven dependencies added to the ch13/sample04/pom.xml file. The opentracing-spring-cloud-starter and opentracing-spring-cloud-starter-jaeger dependencies bring all the dependencies required to publish open tracing-compatible tracing information to Jaeger.
<dependency>
    <groupId>io.opentracing.contrib</groupId>
    <artifactId>opentracing-spring-cloud-starter</artifactId>
    <version>0.1.13</version>
</dependency>
<dependency>
    <groupId>io.opentracing.contrib</groupId>
    <artifactId>opentracing-spring-cloud-starter-jaeger</artifactId>
    <version>0.1.13</version>
</dependency>
From our Spring Boot microservice we use UDP port 5775 to connect to the Jaeger server to publish traces. We need to add the opentracing.jaeger.udp-sender.host and opentracing.jaeger.udp-sender.host properties to the application.properties file.
opentracing.jaeger.udp-sender.host=localhost
opentracing.jaeger.udp-sender.port=5775
Now, let’s run the Order Processing microservice and invoke it with the following cURL command (run from the ch13/sample04 directory). Do it a few times and observe the data recorded under Jaeger via its web console running at http://localhost: 16686/.
> mvn clean install
> mvn spring-boot:run
> curl http://localhost:9000/order/11

Metrics with Prometheus

Prometheus is an open source system for monitoring and alerting. In this section, we see how to use Prometheus to monitor a microservices deployment. The way it works is that all of your microservices will expose their own endpoints to expose their metrics outside and Prometheus will periodically poll those endpoints (see Figure 13-10).

Note

A comprehensive introduction to Prometheus is out of the scope of this book, and we recommend readers looking for details to refer to the Prometheus documentation available at https://prometheus.io/ .

../images/461146_1_En_13_Chapter/461146_1_En_13_Fig10_HTML.jpg
Figure 13-10

Prometheus pulls traced data from the connected microservices

Exposing Metrics from the Spring Boot Microservice

First, let’s see how to instrument our Spring Boot microservice to expose metrics in a format understood by Prometheus. The source code related to this example is available in the ch13/sample05 directory. Let’s look at some of the notable Maven dependencies added to the ch13/sample05/pom.xml file. The simpleclient_spring_boot and simpleclient_hotspot dependencies bring all the dependencies required to expose metrics to Prometheus. The simpleclient_spring_boot dependency introduces the two class-level annotations, @EnablePrometheusEndpoint and @EnableSpringBootMetricsCollector, which are added to the ch13/sample05/OrderProcessingApp.java class file.
<dependency>
    <groupId>io.prometheus</groupId>
    artifactId>simpleclient_spring_boot</artifactId>
    <version>0.1.0</version>
</dependency>
<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>simpleclient_hotspot</artifactId>
    <version>0.1.0</version>
</dependency>
To expose the metrics from the Spring Boot application with no security (we can probably use network level security), add the following property to the ch13/sample05/src/main/resources/application.properties file.
management.security.enabled=false
Now we’re all set. Let’s spin up our Order Processing microservice with the following command (from the sample05 directory).
> mvn clean install
> mvn spring-boot:run
Once the service is up, the metrics published by the service is accessible via http://localhost:9000/prometheus. Here, 9000 is the port where your microservice is running. The following text lists the truncated output from the previous endpoint.
# HELP httpsessions_max httpsessions_max
# TYPE httpsessions_max gauge
httpsessions_max -1.0
# HELP httpsessions_active httpsessions_active
# TYPE httpsessions_active gauge
httpsessions_active 0.0
# HELP mem mem
# TYPE mem gauge
mem 549365.0
# HELP mem_free mem_free
# TYPE mem_free gauge
mem_free 211808.0
# HELP processors processors
# TYPE processors gauge
processors 8.0
# HELP instance_uptime instance_uptime
# TYPE instance_uptime gauge
instance_uptime 313310.0
# HELP uptime uptime
# TYPE uptime gauge
uptime 317439.0
# HELP systemload_average systemload_average
# TYPE systemload_average gauge
systemload_average 2.13720703125
# HELP heap_committed heap_committed
# TYPE heap_committed gauge
heap_committed 481280.0
# HELP heap_init heap_init
# TYPE heap_init gauge
heap_init 262144.0
# HELP heap_used heap_used
# TYPE heap_used gauge
heap_used 269471.0
# HELP heap heap
# TYPE heap gauge
heap 3728384.0
# HELP nonheap_committed nonheap_committed
# TYPE nonheap_committed gauge
nonheap_committed 71696.0

Setting Up Prometheus

Setting up Prometheus is quite straightforward with Docker. First we need to create a prometheus.yml file, which includes all the services Prometheus will monitor. The following example shows a sample file, which includes our Order Processing microservice (sample05) and the Prometheus instance itself. A given prometheus.yml file can have multiple jobs. Here the prometheus job takes care of monitoring itself (which runs on port 9090), while the orderprocessing job is set up to poll 10.0.0.93:9000 endpoint every 10 seconds. Keep in mind that here we need to use the IP address of the node, which runs the Order Processing microservice, and it has to be accessible from the Docker instance, which runs Prometheus.
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:9090']
  - job_name: ‘orderprocessing’
    scrape_interval: 10s
    metrics_path: '/prometheus'
    static_configs:
      - targets: ['10.0.0.93:9000']
Now let’s spin up the Prometheus Docker instance , with the prometheus.yml file.
:> docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Once the Prometheus node is up, go to the URL http://localhost:9090/targets, and it will show all the services, Prometheus monitors, and the status of individual endpoints (see Figure 13-11).
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig11_HTML.jpg
Figure 13-11

Prometheus targets and the status of individual endpoints

Building Graphs with Prometheus

Now we’ve have all the metrics from our Spring Boot microservice, published to Prometheus. Let’s see how to build graphs to monitor those published stats.

First, go to http://localhost:9090/graph and pick the metric that you want to monitor. For example, we picked heap_used. Then when you click on the Graph tab, we should be able to see the used heap graph against time (see Figure 13-12). That way, you can add any number of graphs.

Note

Prometheus was born as an open source project released under Apache 2.0 license at SoundCloud in 2012. It’s mostly written in Go and the community around Prometheus grew up in last few years. In 2016 it was the second project to get the membership of Cloud Native Computing Foundation (CNCF).

../images/461146_1_En_13_Chapter/461146_1_En_13_Fig12_HTML.jpg
Figure 13-12

Using Prometheus to monitor the used heap of a Spring Boot microservice

Analytics and Monitoring with Grafana

Grafana is an open source product for analytics and monitoring, and it is more powerful than Prometheus to build dashboards. In fact, it is recommended to use Grafana to build dashboards with Prometheus. Prometheus had its own dashboarding tool called Promdash, but with the advancements in Grafana, Prometheus developers let it go and started promoting Grafana.

Note

A comprehensive introduction to Grafana is out of the scope of this book, and we recommend readers looking for details to refer to the Grafana documentation available at http://docs.grafana.org/ .

Building a Dashboard with Grafana

Setting up Grafana is quite straightforward with Docker. Use the following command to spin up a Grafana Docker instance, which runs on HTTP port 3000.
:> docker run -d -p 3000:3000 grafana/grafana
Once the server is started, you can log in to the Grafana management console via http://localhost:3000 with the credentials admin/admin. The first thing we need to do is introduce a new data source. Grafana uses data sources to build graphs. Click on Add Data Source and pick Prometheus as the data source type. Give it a name, let’s say prometheus_ds. The rest we can keep as it is, except for the HTTP URL property. This URL should point to the Prometheus server, which must be accessible from the node that runs Grafana. The data source configuration looks like Figure 13-13.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig13_HTML.jpg
Figure 13-13

Grafana data source properties

Note

As per the configuration shown in Figure 13-13, we expect the Prometheus endpoint to be open or not protected. If it is protected, Grafana supports multiple security models, including basic authentication and TLS client authentication.

Now we can create a new dashboard via http://localhost:3000/dashboard/new. Choose the Graph, and then click on the Panel Title and click on Edit. Under Metrics, you choose any metrics made available via the selected data source. As shown in Figure 13-14 , we can define multiple queries. Here we have picked heap and heap_used. A query defines which information we need to pick from the metrics pulled from the Prometheus endpoint, and the graphs are rendered accordingly.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig14_HTML.jpg
Figure 13-14

Setting up queries to build a dashboard with Grafana

Once we complete this process, we can find the dashboard that we just created, listed on Grafana home page under Recently viewed dashboards. You can click on it to view the dashboard, which will appear as shown in Figure 13-15.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig15_HTML.jpg
Figure 13-15

Grafana dashboard to monitor the microservices deployment

Creating Alerts with Grafana

Grafana lets you create alerts and associate them with a graph (or a dashboard panel). To create an alert corresponding to the dashboard we created in the previous section, first we need to edit the graph (see Figure 13-16) by clicking on its title and choosing Edit. Under Alerts, we can create a rule, which states under which conditions the system should raise an alert (see Figure 13-17). The only type of condition that Grafana supports at the moment is a query and for an alert rule, you can add multiple queries linked to each other (with an AND or an OR).

Note

Grafana alerting support is only limited to the data sources: Graphite, Prometheus, ElasticSearch, InfluxDB, OpenTSDB, MySQL, Postgres, and Cloudwatch.

../images/461146_1_En_13_Chapter/461146_1_En_13_Fig16_HTML.jpg
Figure 13-16

The Grafana dashboard monitors the microservices deployment

The following query indicates that when the maximum value of the metric A (see Figure 13-14, in our case heap) is above 3 for the next five minutes, raise an alert. This rule is evaluated every 60 seconds.
WHEN max() OF query(A, 5m, now) IS ABOVE 3
In addition to the max() function, Grafana also supports avg(), min(), sum(), last(), count(), median(), diff(), percent_diff(), and count_non_null(). See Figure 13-17.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig17_HTML.jpg
Figure 13-17

Configuring alerts

Once the alert rules are set, we can configure whom to send the notifications along with a message, under the Notifications menu, when an alert is raised (see Figure 13-18). Grafana supports multiple notification channels, including Email, PagerDuty, Telegram, Slack, and many more.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig18_HTML.jpg
Figure 13-18

Configuring notifications

Using Fluentd Log Collector with Docker

Fluentd is an extensible data collection tool that runs as a daemon. Microservices can publish logs to Fluentd. It has a rich set of plugins, which can read logs in different formats from different sources and parse the data. Also, it can format, aggregate, and publish logs to third-party systems like Splunk, Prometheus, MongoDB, PostgreSQL, AWS S3, Kafka, and many more (see Figure 13-19). The beauty of Fluentd architecture is that it decouples data sources from target systems. With no changes to your microservice, Fluentd can change the target system of your logs or add new target systems. Also it can do content filtering on log messages and, based on certain criteria, decide which systems to publish the logs.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig19_HTML.jpg
Figure 13-19

Multiple input sources and target systems with Fluentd

In the following sections, we see how to publish logs to Fluentd from a microservice running on a Docker container. First we’ll set up Fluentd and then see how to spin up a microservice and publish logs to Fluentd.

Starting Fluentd as a Docker Container

The easiest and the most straightforward way to spin up Fluentd is using a Docker container. In practice, this is the most common approach too. In a production setup, where all your microservices are running in a Kubernetes environment (which we discussed in Chapter 8), the Fluentd node (which acts as a daemon) will run in the same pod, along with the corresponding microservice. In fact, you can treat the container running the Fluentd as a sidecar to the microservice (see Figure 13-20). The microservice, by default, will publish the logs to localhost:24224, which is the port Fluentd is listening to over TCP.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig20_HTML.png
Figure 13-20

Microservice and Fluentd containers are running in the same pod

Let’s use the following command to spin up the Fluentd Docker container. Before that, make sure you have a directory called data in the the home directory of the host filesystem—or you can have your own directory instead of ~/data.
:> docker run -d  -p 24224:24224 -v ~/data:/fluentd/log  fluent/fluentd

As we learned in Chapter 8, containers are immutable. In other words, when a container goes down it will not save any of the changes we have done to its filesystem while running. By default all the logs published to Fluentd by the Order Processing microservice will be stored in the /fluent/log directory in the container filesystem. To persist that data permanently, we need to use a Docker volume. Using the –v option in the previous command, we map the ~/data directory in the host filesystem to the /fluent/log directory in the container filesystem. Even when the container goes down, we should be able to find the log files in the ~/data directory. The –p option in the previous command maps the port 24224 from the Docker container (the port Fluentd listens by default) to port 24224 of the host machine. fluent/fluentd is the name of the container image, which will be pulled from the Docker Hub.

Once we get the Fluentd up and running, we can start a Docker container with the Order Processing microservice.

Publishing Logs to Fluentd from a Microservice Running in a Docker Container

Here we are going to use the same Order Processing microservice, which we discussed through out the book, but instead of building it from the source code, we are going to pull it from the Docker hub. The following command will spin up a Docker container (having the image name prabath/sample01) with the Order Processing microservice.
:> docker run -d -p 9000:9000 --log-driver=fluentd prabath/sample01

In this command, we use the log-driver argument with the value fluentd8. Docker uses this driver to publish logs from the stdout9 (by default) to the Fluentd daemon (or the container, which runs Fluentd). The microservices developer does not need to make any changes here or do anything specific to Fluentd. By default, the fluentd log driver connects to localhost:24224 over TCP. If we run Fluentd on a different port, we need to pass the fluentd-address argument to the docker run command, with a value pointing to the Fluentd container (e.g., fluentd-address=localhost:28444).

If it all works fine, once the Order Processing service is started, we should see some logs available in the ~/data directory in the host filesystem. This is the directory we used to create a Docker volume before.

Note

A comprehensive introduction to Fluentd is out of the scope of this book. We recommend readers looking for details to refer to the Fluentd documentation available at https://docs.fluentd.org/ .

How It Works

Fluentd uses a config file, which defines the input sources and the output targets. By default it is in the /fluentd/etc directory of the container filesystem where Fluentd is running. The file is called fluent.conf. Let’s look at the default content of fluent.conf. The source tag defines where the data comes from. Under that we have forward and port elements, which make Fluentd accepts messages on port 24224 over TCP. The responsibility of the source tag is to accept messages and hand them over to the Fluentd routing engine as events. Each event has three elements: tag, time, and record. The sender (in our case the fluentd driver) defines the value of the tag.

Note

You can find all the configurations related to the fluentd Docker image from here: https://hub.docker.com/r/fluent/fluentd/ . It also explains how to override the default Fluentd configuration file, which ships with the image.

<source>
  @type  forward
  @id    input1
  @label @mainstream
  port  24224
</source>
<filter **>
  @type stdout
</filter>
<label @mainstream>
  <match docker.**>
    @type file
    @id   output_docker1
    path         /fluentd/log/docker.*.log
    symlink_path /fluentd/log/docker.log
    append       true
    time_slice_format %Y%m%d
    time_slice_wait   1m
    time_format       %Y%m%dT%H%M%S%z
  </match>
<match **>
    @type file
    @id   output1
    path         /fluentd/log/data.*.log
    symlink_path /fluentd/log/data.log
    append       true
    time_slice_format %Y%m%d
    time_slice_wait   10m
    time_format       %Y%m%dT%H%M%S%z
  </match>
</label>

The match element will tell Fluentd what to do with the matching messages. It matches the value of the tag element in each event with the criteria defined in the match element. In our case, it checks whether the tag starts with the word docker. The most common use case of the match element is to define the output targets. In the previous configuration, the output is written to a file in the /fluentd/log directory (in the container). If we want to send the output to different other systems, we can use any of the available Fluentd plugins10. Finally, the label element defined under the source element acts as a reference. For example, when the match element gets executed, if there is a label defined under the source, then only the match element under the corresponding label will get executed. The objective of the label element is to reduce the configuration file complexity.

Note

Logstash11, which is part of the well-known ELK (ElasticSearch, Logstash, Kibana) stack, provides similar functionality to Fluentd. Before you pick a logging solution for your enterprise, we recommend evaluating the pros and cons of both Fluentd and Logstash.

Using Fluentd in a Microservices Deployment

In this section, we see how to extend this Fluentd example architecturally to fit into a production microservices deployment. As we discussed (and as shown in Figure 13-21), each Kubernetes pod (which we discussed in Chapter 8) will have an instance of Fluentd running in a container. This container can be treated as a sidecar. Each microservice in our deployment will have a similar setup. Each Fluentd node can filter out the logs they need and publish them to another Fluentd node, which can do the log aggregation. This Fluentd node also can decide what other target systems it has to publish logs.
../images/461146_1_En_13_Chapter/461146_1_En_13_Fig21_HTML.jpg
Figure 13-21

Fluentd in a production deployment

Summary

In this chapter, we discussed one of the key aspects of the microservices architecture—the observability—and the three pillars behind observability—logging, metrics, and tracing. We also discussed distributed tracing, which is in fact the most important enabler of observability. Distributed tracing helps tracing a request, which spans across multiple systems. We used Spring Cloud Sleuth, Zipkin, and Jaeger to build a distributed tracing system and used Prometheus and Grafana for visualization, monitoring, and alerting. Finally, we discussed how to use Fluentd, an extensible log-collection tool, in a containerized deployment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset