This chapter covers
In the previous chapter, we focused on emitting metrics from your services and using those metrics to create dashboards and alerts. Metrics and alerts are only one part of what you need to achieve observability in your microservice architecture. In this chapter, we’ll focus on collecting logs and making sure you’re able to trace the interactions between services. This will allow you to not only have an overview of how the system behaves but also go back in time and retrospectively follow each request. Doing so is important to debug errors and to identify bottlenecks. Logs give you a sort of paper trail that documents the history of each request entering your system, whereas traces provide you a way to establish a timeline for each request, to understand how much time it spent in different services.
By the end of this chapter, you’ll have created a basic logging infrastructure and set up the tracing capability. You’ll be able to both monitor the operation of your application and have the tools to audit and investigate in case you need to do so for particular requests. In addition, you’ll be able to identify performance issues by looking into tracing data.
In a microservices-based architecture, multiple services will be involved in providing functionality to users. It gets hard to understand what goes on with every request when you no longer have a central access point to data. Services are distributed across multiple nodes, are ephemeral, and are continuously being deployed and scaled to meet the needs of operation. Let’s revisit the sell order use case as you might have implemented it if you’d needed a single application running on a single machine (figure 12.1).
In figure 12.1, we’ve represented each of the services that collaborate to allow a client to sell shares as modules in the same application. If you were to inspect a given request lifecycle in the system, you could log in to the machine and inspect log data stored on a hard drive. But you’d most likely have multiple machines running the application, for redundancy and availability, so things wouldn’t be as easy as logging in to one machine. Once you identified the request you were interested in observing, you’d have to identify which machine had run the request and then inspect it. Going through the logs from that machine would provide you needed insights.
Maintaining logs in a single machine is by no means easy — a server can also crash and become unavailable. Our aim here isn’t to talk about minimizing the complexity of keeping log data (or any data) safely persisted but to point out that having a single point for storing all the data makes it easier and more convenient to consult.
Let’s now compare the same scenario in a microservices application. Figure 12.2 illustrates the same use case with multiple services, each with multiple copies of itself, running independently.
As you can see below, you have five services running independently, and each of those services has three instances running. This means potentially none of the pods are executing on the same physical machine. A request coming into the system will most likely flow through multiple pods running on different physical machines, and you have no easy way to track down that request by accessing logs. Could you even do it? What guarantee do you have that any of the pods are still running once you need to access data?
In figure 12.3, we illustrate the challenges you face when you try to gather data from a distributed system. Even if you had some sort of persistency for the log data that would allow it to survive after a running pod is replaced, it’d be no easy task to track down a request through your system. You need a better way to record what’s going on with your system. To be able to fully understand behavior, you need to
Our objective by the end of this chapter will be to have an infrastructure that allows you to collect log data from all your services, aggregating it and allowing you to perform searches in it so you can, at any time, reason through the behavior of the system. You’ll be able to use the available data to audit, debug, or even gather new insights by processing it further. One example of the processing you can do to augment the available information is to collect the IP data stored in the logs and generate a visualization showing the most common geographic areas of your users.
To effectively store and make your log data searchable, you first need to agree on a format the engineering team will use. A consistent format will help to guarantee that you can store and process data effectively.
To be able to achieve observability, you have to collect data from multiple sources; not only from running services but also from your infrastructure. Defining a common format allows you to analyze data more easily and perform searches in that data using existing tools with minimal effort. Examples of the data you may collect and use are:
For some components, you can’t control the format, so you have to cope with their specificities and transform them somehow. But for now, let’s focus on what you can control: your services. Making sure the whole engineering team abides by a format and a way of doing things pays off in the long run, because data collection will become simpler and more effective. Let’s start by determining what you should store; then we can look at how to store it.
For your log data to be useful and effective in helping you to understand behavior in your systems, you need to make sure it includes certain information that will allow you to communicate certain things. Let’s look into what you should include as part of each log entry.
To be able to correlate data and order it appropriately, you need to make sure you attach timestamps to your log entries. Timestamps should be as granular and verbose as possible; for example, use four-digit years and the best resolution available. Each service should render its own timestamps, preferably in microseconds. Timestamps also should include a time zone, and it’s advisable that you collect data as GMT/UTC whenever possible.
Having these details allows you to avoid issues with correlating data from different services with different time zones. Ordering data by time of occurrence will become much easier and require less context while analyzing. Getting timestamps right is the first step in making sure you can understand the sequence in which events took place.
You should use unique identifiers whenever, and as much as, possible in the data you intend to log. Request IDs, user IDs, and other unique identifiers are invaluable when you’re cross-referencing data from multiple sources. They allow you to group data from different sources in an effective way.
Most of the time, these IDs already exist in your system because you need to use them to identify resources. It’s likely they’re already being propagated through different services, so you should make the best use out of them. Unique identifiers used alongside timestamps yield a powerful tool to understand the flow of events in a system.
Identifying the source of a given log entry allows easier debugging when needed. Typical source data you can use includes:
When adding execution times on a given function call, the information you’ve collected for the source allows you to infer performance because you can extrapolate execution times, even if not in real time. Although this isn’t a replacement for collecting metrics, it can be effective in helping to identify bottlenecks and potential performance issues.
Each log entry should contain a category. The category can be either the type of data you’re logging or the log level. Typically, the following values are used as log levels: ERROR
, DEBUG
, INFO
, WARN
.
The category will allow you to group data. Some tools can parse log files searching for messages with the ERROR
level and communicate them to error reporting systems. This is a perfect example of how you can use the log level or category to automate the process of error reporting without the need for explicit instructions.
You want to generate log entries in a human-readable format, but at the same time they need to be easily parseable by a machine. What we mean by human readable is avoiding binary encoding of data or any type of encoding that your average human can’t understand. An example of this would be storing the binary representation of an image. You should probably use its ID, file size, and other associated data instead.
You also should avoid multiline logs because they can lead to fragmentation while parsing them in log aggregation tools. With such logs, it can be easy to lose some of the information associated with a particular log entry, like the ID, timestamp, or source.
For the examples in this chapter, you’ll be using JSON to encode your log entries. Doing so allows you to provide human-readable and machine-parseable data, as well as to automatically include some of the data we mentioned in the previous section.
In chapter 7, when we were discussing a microservice chassis, we introduced a Python library that provides log formatting: logstash-formatter
. Logstash libraries are available for different languages, so you can expect the format to be widespread and easily usable no matter what language you chose to code your services in.
Let’s now look into a log entry collected using the Logstash library for Python. This message is formatted using V1 of the logstash format, and the application generated it automatically when it was booting, without the need for any specific code instruction to log it:
{
"source_host" : "e7003378928a", ①
"pathname" : "usrlocallibpython3.6site-packagesnamekorunners.py",
➥"relativeCreated" : 386.46125793457031,
"levelno" : 20,
"msecs" : 118.99447441101074, ②
"process" : 1,
"args" : [ "orders_service" ],
"name" : "nameko.runners",
"filename" : "runners.py", ③
"funcName" : "start", ③
"module" : "runners", ③
"lineno" : 64, ③
"@timestamp" : "2018-02-02T18:42:09.119Z", ④
"@version" : 1, ⑤
"message" : "starting services: orders_service", ⑥
"levelname" : "INFO", ⑦
"stack_info" : null,
"thread" : 140612517945416,
"processName" : "MainProcess",
"threadName" : "MainThread",
"msg" : "starting services: %s",
"created" : 1520275329.1189945
}
As you can see, the Logstash library inserts relevant information, taking that burden from the developer’s shoulders. In the following listing, you’ll see how an explicit log call in code renders a log entry.
Listing 12.1 Logstash V1 formatted log message after an explicit log instruction
# Python code for generating a log entry
self.logger.info ({"message": "Placing sell order",
➥"uuid": res}) ①
{
"@timestamp": "2018-02-02T18:43:08.221Z",
"@version": 1,
"source_host": "b0c90723c58f",
"name": "root",
"args": [],
"levelname": "INFO", ②
"levelno": 20,
"pathname": "./app.py",
"filename": "app.py",
"module": "app",
"stack_info": null,
"lineno": 33,
"funcName": "sell_shares",
"created": 1520333830.3000789,
"msecs": 300.0788688659668,
"relativeCreated": 15495.944738388062,
"thread": 140456577504064,
"threadName": "GreenThread-2",
"processName": "MainProcess",
"process": 1,
"message": "Placing sell order", ③
"uuid": "a95d17ac-f2b5-4f2c-8e8e-2a3f07c68cf2" ④
}
In your explicit call to the logger, you’ve only stated the desired level and the message to log in the form of key-value pairs containing a message and a UUID. Logstash automatically collected and added all the other information present in the log entry without you having to declare it explicitly.
Now that you’ve set a format for collecting and presenting info, you can move on to creating a basic logging infrastructure. In this section, you’ll be setting up the infrastructure that’ll allow you to collect logs from all the running services and aggregate them. It also will provide you with search and correlation capabilities. The purpose is to have a central access point to all the log data like you already have for metrics. In figure 12.4, we illustrate what you want to achieve after setting up a logging infrastructure.
Once you set up the log aggregation capability, like you did for metrics in chapter 11, you’ll have all services sending both metrics and logs to centralized systems that’ll allow you to improve observability. You’ll be able to observe data about a running system and dig deeper to collect more information in case you need to audit or debug a particular request. You’ll set up a solution commonly called ELK (Elasticsearch, Logstash, and Kibana) and will use a data collector called Fluentd.
You’ll build the logging infrastructure we propose using Elasticsearch, Logstash, and Kibana. Also, you’ll use Fluentd for pushing logs from the apps to your centralized logging solution. Before we get into more details about these technologies, have a look at figure 12.5 to get an overview of what we want to enable you to achieve.
In figure 12.5, you can see how you can collect the logs for multiple instances of the gateway service and forward them to your centralized logging system. We represent multiple instances of the same service, but this will work for any of the services you have running. Services will redirect all the log information to STDOUT
(standard output), and an agent running the Fluentd daemon will be responsible for pushing those logs into Elasticsearch.
By following this pattern when deploying any new services, you’ll make sure log data gets collected and indexed and becomes searchable. But before we move to implementation, we’ll take a little time to introduce each of the technologies you’ll be using.
Elasticsearch (www.elastic.co/products/elasticsearch) is a search and analytics engine that stores data centrally. It indexes data, in your case log data, and allows you to perform efficient search and aggregation operations on the data it has stored.
Logstash (www.elastic.co/products/logstash) is a server-side processing pipeline that allows data ingestion from multiple sources and transforms that data prior to sending it to Elasticsearch. In your case, you’ll be using the Logstash formatting and data collection capabilities by taking advantage of client libraries. In this chapter’s earlier examples, you already observed its ability to provide consistent data that you can send to Elasticsearch. But here you won’t be using Logstash to send data; you’ll be using Fluentd instead.
Kibana (www.elastic.co/products/kibana) is a UI for visualizing Elasticsearch data. It’s a tool you can use to query data and explore its associations. In your use case, it’ll operate on log data. You can use Kibana to derive visualizations from gathered data, so it’s more than a search tool. Figure 12.6 shows an example of a dashboard powered by Kibana.
Fluentd (www.fluentd.org) is an open source data collector that you’ll be using to push data from your services to Elasticsearch. You’ll combine the data formatting and collecting capabilities of Logstash and use Fluentd to push that data. One of its advantages is the fact that you can use it as a logging provider for Dockerfiles if you declare it in Docker compose files.
You’ll set up your solution via the Docker compose file, like you already did in chapter 11 to create the metrics collecting and alert infrastructure. You can find all the code used in this chapter at Github (http://mng.bz/k191). There, you’ll find the docker-compose.yml file, where you’ll be declaring the new dependencies. The following listing shows the new components added to the compose file.
Listing 12.2 Docker compose file with Elasticsearch, Kibana, and Fluentd containers
version: '2.1'
services:
gateway:
container_name: simplebank-gateway
restart: always
build: ./gateway
ports:
- 5001:5000
volumes:
- ./gateway:/usr/src/app
links:
- "rabbitmq:simplebank-rabbitmq"
- "fluentd”
logging: ①
driver: "fluentd"
options:
fluentd-address: localhost:24224
tag: simplebank.gateway
(…)
kibana:
image: kibana ②
links:
- "elasticsearch" ③
ports:
- "5601:5601"
elasticsearch:
image: elasticsearch ④
expose:
- 9200
ports:
- "9200:9200"
fluentd:
build: ./fluentd ⑤
volumes:
- ./fluentd/conf:/fluentd/etc ⑥
links:
- "elasticsearch" ⑦
ports:
- "24224:24224"
- "24224:24224/udp"
(…)
Once you’ve added this content to the docker compose file, you’re almost ready to boot your logging infrastructure alongside your services. But first let’s cover the missing tweaks we mentioned previously that’ll allow you to configure Fluentd to your needs. The Dockerfile you use for building Fluentd follows.
Listing 12.3 Fluentd Dockerfile (Fluentd/Dockerfile)
FROM fluent/fluentd:v0.12-debian ①
RUN ["gem", "install", "fluent-plugin-elasticsearch",
➥"--no-rdoc", "--no-ri", "--version", "1.9.2"] ②
Now all you need is to create a configuration file for Fluentd. The following listing shows the config file.
Listing 12.4 Fluentd configuration (fluentd/conf/fluent.conf)
<source> ①
@type forward ②
port 24224
bind 0.0.0.0
</source>
<match *.**> ③
@type copy ④
<store>
@type elasticsearch ④
host elasticsearch
port 9200
logstash_format true
logstash_prefix fluentd
logstash_dateformat %Y%m%d
include_tag_key true
type_name access_log
tag_key @log_name
flush_interval 1s
</store>
<store>
@type stdout ④
</store>
</match>
The match section in the Fluentd configuration file contains all the needed configuration for connecting with elasticsearch
, port
, and host
, as well as the format used. You’re using logstash format, as you may recall.
With all the needed setup done, you’re now ready to boot your services using the new Docker compose file. Before doing so, let’s go through your services and change code to enable you to send data to your centralized logging infrastructure. In the next section, you’ll configure your services to make use of the Logstash logger. You’ll also set log levels.
In your services, you can control the log level via environment variables, so you can have different levels for development and production environments. Using different log levels allows you to enable more verbose logs in production, in case you need to investigate any issue.
Let’s look at your gateway service for its logging configuration and also at the service code to understand how you can emit log messages. The logging configuration is shown in the following listing.
Listing 12.5 Configuration file for the gateway service (gateway/config.yml)
AMQP_URI: amqp://${RABBIT_USER:guest}:${RABBIT_PASSWORD:guest}@${RABBIT_HOST:
➥localhost}:${RABBIT_PORT:5672}/
WEB_SERVER_ADDRESS: '0.0.0.0:5000'
RPC_EXCHANGE: 'simplebank-rpc'
LOGGING: ①
version: 1
handlers:
console:
class: logging.StreamHandler ②
root:
level: ${LOG_LEVEL:INFO} ③
handlers: [console] ④
This configuration will allow setting the log level when the application boots. In the Docker compose file, you’ve set the environment variable LOG_LEVEL
as INFO
for all services except for the Gateway, which has a DEBUG
value. Let’s now look into the gateway code to set up logging, as shown in the following listing.
Listing 12.6 Enable logging in the gateway service (gateway/app.py)
import datetime
import json
import logging ①
import uuid
from logstash_formatter import LogstashFormatterV1 ②
from nameko.rpc import RpcProxy, rpc
from nameko.web.handlers import http
from statsd import StatsClient
from werkzeug.wrappers import Request, Response
class Gateway:
name = "gateway"
orders = RpcProxy("orders_service")
statsd = StatsClient('statsd', 8125,
prefix='simplebank-demo.gateway')
logger = logging.getLogger() ③
handler = logging.StreamHandler() ③
formatter = LogstashFormatterV1() ③
handler.setFormatter(formatter) ③
logger.addHandler(handler) ③
@http('POST', '/shares/sell')
@statsd.timer('sell_shares')
def sell_shares(self, request):
req_id = uuid.uuid4()
res = u"{}".format(req_id)
self.logger.debug(
"this is a debug message from gateway",
➥extra={"uuid": res}) ④
self.logger.info("placing sell order", extra=
➥{"uuid": res}) ⑤
self.__sell_shares(res)
return Response(json.dumps(
{"ok": "sell order {} placed".format(req_id)}),
➥mimetype='application/json')
@rpc
def __sell_shares(self, uuid):
self.logger.info("contacting orders service", extra={
➥"uuid": uuid}) ⑤
res = u"{}".format(uuid)
return self.orders.sell_shares(res)
@http('GET', '/health')
@statsd.timer('health')
def health(self, _request):
return json.dumps({'ok': datetime.datetime.utcnow().__str__()})
In listing 12.2, you saw how to enable the Fluentd driver for logging with Docker. This means you’re ready to send log data generated from your services to Elasticsearch using Fluentd, and afterwards you’ll be able to explore that log data using Kibana. To start all the services, metrics, and logging infrastructure from a console in the root directory, execute the following command:
docker-compose up --build --remove-orphans
Once all is ready, you need to complete one last step, which is configuring Kibana to use the logs collected via Fluentd and stored in Elasticsearch. To do so, access the Kibana web dashboard (http://localhost:5601). On the first access, you’ll be redirected to the management page, where you’ll need to configure an index pattern. You need to tell Kibana where to find your data. If you recall, in the Fluentd configuration, you set an option for the logstash prefix with the value fluentd
. This is what you need to enter in the index text box presented to you. Figure 12.7 shows the Kibana dashboard management section and the value you need to input.
After inserting fluentd-*
as the index pattern and clicking the Create button, you’ll be ready to explore all the log data that your multiple services create. Elasticsearch will forward all data to a central location, and you’ll be able to access it in a convenient way.
To generate some log data, all you need to do is create a sell request to your gateway service. To do so, you need to issue a POST
request to the gateway. The following shows presenting a request via curl, but any tool capable of generating a POST
request will do:
chapter-12$
curl -X POST http://localhost:5001/shares/sell
-H 'cache-control: no-cache'
-H 'content-type: application/json' ①
chapter-12$
{"ok": "sell order e11f4713-8bd8-4882-b645
➥
-55f96d220e44 placed"} ②
Now that you have log data collected, you can explore it using Kibana. Clicking on the Discover section on the left side of Kibana’s web dashboard will take you to a page where you can perform searches. In the search box, insert the request UUID you received as the sell order response. In the case of this example, you’d be using e11f4713-8bd8-4882-b645-55f96d220e44
as your search parameter. In the next section, we’ll show you how to use Kibana to follow the execution of a sell request through the different services involved.
In the code example for this chapter, you have five independent services collaborating to allow SimpleBank users to sell shares. All services are logging their operations and using the request UUID as a unique identifier to allow you to aggregate all log messages referring to a sell order that’s processing. You can make use of Kibana to explore your logs and track down the execution of a request. In figure 12.8, you use the order ID that the gateway service returns to perform the search.
When you use the request ID as the search parameter, Kibana filters the log data, and you get 11 hits that allow you to follow the execution of a request through different services. Kibana allows you to use complex queries to be able to uncover insights. You get the ability to filter per service, sort by time, and even use log data — for example, execution times present in the log entries — to create dashboards to track performance. This use is beyond the scope of this chapter, but do feel free to explore the possibilities offered to get new perspectives on the collected data.
We’ll now zoom in a bit and focus on some of the log entries that your query shows. Figure 12.9 shows some of those entries with a bit more detail.
In figure 12.9, you’ll find messages from the market and gateway services. For the latter, given that the log level selected was DEBUG, you’ll find both info and debug messages. As mentioned previously, by using the logstash-formatter
library in your Python code, you get more information for free. You can find data regarding execution times and scoping of the execution by module, function, line, process, and thread. All of this info can be useful if you need to diagnose any issues in the future.
Now that you’re able to collect logs and store them, you need to be careful about what information you send via logs. Things like passwords, credit card numbers, ID card numbers, and potentially sensitive personal data that get sent will be stored and accessible by anyone who can use the logging infrastructure. In your case, you’re hosting and controlling the logging infrastructure, but if you were using a third-party provider, you’d need to pay extra attention to these details. You have no easy way to delete only some of the data already sent. In most cases, if you want to delete something specific, it means deleting all the log data for a given period of time.
Data privacy is a hot topic at the moment, and with the EU General Data Protection Regulation (GDPR) (www.eugdpr.org) now in effect, you need to take extra care when considering which data to log and how to log it. We won’t explore the needed steps in depth here, but both Fluentd and Elasticsearch allow you to apply filtering to data so that any sensitive fields get masked, encrypted, or removed from the data that they receive and that Elasticsearch indexes. The general rule would be to log as little information as you can, avoid any personal data in logs, and take extra care with reviewing what gets logged before any changes make it into the production environment. Once your services send data, it’s hard to erase it and doing so will have associated costs.
That said, you can and should use logs to communicate useful information to allow you to understand system behavior. Sending IDs that allow you to correlate actions of different systems and terse log messages indicating actions performed in or by systems can help you keep track of what’s happened.
When you were setting up your log infrastructure and making the code changes to emit log messages, you already took care of propagating an ID field that allows you to follow the execution path of a request through your system. With this set up, you can group log entries under the same context. You may even use log data to create visualizations that’ll allow you to understand how much time each component took to process a request. In addition, you can use it to help you identify bottlenecks and places where you can improve your code to have extra performance. But logs aren’t the only tool available — you have another method at your disposal for doing this that doesn’t rely on log data.
You can do better by reconstructing the journey of requests through your microservices. In this section, you’ll be setting up distributed tracing to allow you to visualize the flow of execution between services and at the same time, provide insights on how long each operation takes. This can be valuable, not only to understand the order in which a request flows through multiple services, but also to identify possible bottlenecks. For this, you’ll use Jaeger and libraries compatible with the OpenTracing API (http://opentracing.io).
A trace is a direct acyclic graph (DAG) of one or more spans, where the edges of those spans are called references. Traces are used to aggregate and correlate execution flow through the system. To do so, some information needs to be propagated. A trace captures the whole flow.
Let’s look at figures 12.10 and 12.11. In these figures, we represent a trace made up of multiple spans, from both a dependency perspective and a temporal perspective.
In figure 12.10, you can observe the dependency relationship between different spans. These spans can be triggered either in the same application or in different ones. The only requirement is for the parent span ID to be propagated, so when a new span is triggered, it’ll hold a reference to its parent span.
In figure 12.11, you have a view of spans from a temporal perspective. By using temporal information contained in spans, you can organize them in a timeline. You can see not only when each span happened relative to other spans but also how long each operation that a span encapsulates took to complete.
Each span contains the following information:
The span context contains the needed data to refer to different spans, either locally or across service boundaries.
Let’s now move on to setting up tracing between services. You’ll be using Jaeger (www.jaegertracing.io), a distributed tracing system, as well as a set of Python libraries that are OpenTracing compatible.
To be able to display tracing information and correlate requests between different services, you’ll need to set up a collector and a UI for traces, as well as including some libraries and setting them up in your services. The services we’ll use as an example for distributed tracing will be the SimpleBank profile and settings services. In figure 12.12, we give an overview of the interactions you’ll be tracing.
The profile service will contact an external service, in this case jsontest.com, to retrieve its IP and will also be fetching user settings from the settings service. You’ll be setting up a tracing system (Jaeger) and making the code changes needed to display the trace and its spans and to be able to correlate those spans. Correlating them will allow you to understand in detail how long each operation took and how it contributed to the overall execution time of a call to the profile service. You’ll begin by setting up Jaeger, the distributed tracing system collector and UI, by adding a Docker image into your docker-compose.yml file (listing 12.7).
Listing 12.7 Adding Jaeger to the docker-compose.yml file
(…)
jaeger:
container_name: jaeger
image: jaegertracing/all-in-one:latest ①
ports:
- 5775:5775/udp ②
- 6831:6831/udp
- 6832:6832/udp
- 5778:5778
- 16686:16686 ③
- 14268:14268
- 9411:9411 ④
environment:
COLLECTOR_ZIPKIN_HTTP_PORT: "9411" ④
With the Docker image added to your docker-compose file, once you boot all the SimpleBank infrastructure, you’ll have a distributed tracing system in place. Now you need to make sure that the SimpleBank profile and settings services are able to create traces and spans and communicate them to Jaeger.
Let’s add the needed libraries to both the settings and profile services and initialize the tracer. The following listing adds the tracing libraries.
Listing 12.8 Adding the tracing libraries to the services via a requirements.txt file
Flask==0.12.0
requests==2.18.4
jaeger-client==3.7.1 ①
opentracing>=1.2,<2 ②
opentracing_instrumentation>=2.2,<3 ③
By adding these libraries, you’re now able to create traces and spans from both services. To make the process easier, you can also create a module to provide a convenient setup function to initialize the tracer, as shown in the following listing.
Listing 12.9 Tracer initializer lib/tracing.py
import logging
from jaeger_client import Config ①
def init_tracer(service): ②
logging.getLogger('').handlers = []
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'local_agent': { ③
'reporting_host': "jaeger",
'reporting_port': 5775,
},
'logging': True, ④
'reporter_batch_size': 1,
},
service_name=service, ⑤
)
return config.initialize_tracer()
The SimpleBank profile and settings services will both be using the tracer initialization function shown in listing 12.9. This allows them to establish the connection to Jaeger. As we showed in figure 12.12, the profile service contacts both an external service and the settings internal service. You’ll be tracing the interaction of the profile service with both of these collaborators. In the case of the interaction with the SimpleBank settings service, you’ll need to pass along the context of the initial trace so you can visualize the full cycle of a request.
Listing 12.10 shows the profile service code where you set up the spans for both the external http service and the settings service. For the former, you create a span, and for the latter, you pass along the current span as a header so the settings service can make use of it and create child spans.
Listing 12.10 Profile service code
from urlparse import urljoin
import opentracing
import requests
from flask import Flask, jsonify, request
from opentracing.ext import tags ①
from opentracing.propagation import Format ①
from opentracing_instrumentation.request_context import
➥get_current_span ①
from opentracing_instrumentation.request_context import
➥span_in_context ①
from lib.tracing import init_tracer ②
app = Flask(__name__)
tracer = init_tracer('simplebank-profile') ③
@app.route('/profile/<uuid:uuid>')
def profile(uuid):
with tracer.start_span('settings') as span: ④
span.set_tag('uuid', uuid)
with span_in_context(span):
ip = get_ip(uuid)
settings = get_user_settings(uuid)
return jsonify({'ip': ip, 'settings': settings})
def get_ip(uuid):
with tracer.start_span('get_ip', child_of=
➥get_current_span()) as span: ⑤
span.set_tag('uuid', uuid)
with span_in_context(span): ⑥
jsontest_url = "http://ip.jsontest.com/"
r = requests.get(jsontest_url)
return r.json()
def get_user_settings(uuid):
settings_url = urljoin("http://settings:5000/
➥settings/", "{}".format(uuid))
span = get_current_span() ⑦
span.set_tag(tags.HTTP_METHOD, 'GET') ⑧
span.set_tag(tags.HTTP_URL, settings_url) ⑧
span.set_tag(tags.SPAN_KIND, tags.SPAN_KIND_RPC
➥_CLIENT) ⑧
span.set_tag('uuid', uuid) ⑧
headers = {}
tracer.inject(span, Format.HTTP_HEADERS, headers) ⑧
r = requests.get(settings_url, headers=headers)
return r.json()
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)
The SimpleBank profile service initializes a trace that’ll be used to group different spans. It creates spans for the calls to "http://ip.jsontest.com/"
and for the call to the SimpleBank settings service. For the former, given that you don’t own the service, you execute the call wrapped in a span. But for the latter, because you control it, you can pass on span information that’ll be used to create child spans. This will allow you to group all related calls in Jaeger.
Let’s now look into how you can make use of the injected span in the SimpleBank settings service, as shown in the following listing.
Listing 12.11 Using a parent span in the settings service
import time
from random import randint
import requests
from flask import Flask, jsonify, request
from opentracing.ext import tags
from opentracing.propagation import Format
from opentracing_instrumentation.request_context import get_current_span
from opentracing_instrumentation.request_context import span_in_context
from lib.tracing import init_tracer
app = Flask(__name__)
tracer = init_tracer('simplebank-settings') ①
@app.route('/settings/<uuid:uuid>')
def settings(uuid):
span_ctx = tracer.extract(Format.HTTP_HEADERS,
➥request.headers) ②
span_tags = {tags.SPAN_KIND: tags.SPAN_KIND_RPC
➥_SERVER, 'uuid': uuid} ③
with tracer.start_span('settings', child_of=span
➥_ctx, tags=span_tags): ④
time.sleep(randint(0, 2))
return jsonify({'settings': {'name': 'demo user', 'uuid': uuid}})
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)
By extracting the span context from the request the settings service receives, you can then make use of it as the parent to new spans. You can later visualize this new child span independently. But you’ll also be able to take advantage of the fact that Jaeger will show the span as both an independent span in the context of the SimpleBank settings service and a child span in the context of the SimpleBank profile service.
With all the setup out of the way, all you need to do to start collecting traces is to issue a request to the SimpleBank profile endpoint. You can use the command line or a browser. To access traces via the command line, you can use curl to issue the following request:
$ curl http://localhost:5007/profile/26bc34c2-5959-4679-9d4d-491be0f3c0c0
{
"ip": {
"ip": "178.166.53.17"
},
"settings": {
"settings": {
"name": "demo user",
"uuid": "26bc34c2-5959-4679-9d4d-491be0f3c0c0"
}
}
}
Here’s a brief recap of what’s going on when you hit the profile endpoint:
To visualize the traces, you need to access the Jaeger UI that’ll be running on port 16686. Figure 12.13 shows the Jaeger UI and the list of services that have traces available.
In the Service section, you see three services for which trace information is available: two SimpleBank services and one called jaeger-query. The latter gathers Jaeger internal traces and is of little use to you. You’re interested in the other two services listed: simplebank-profile and simplebank-settings. If you recall, the profile service was creating spans for the execution of an external call, as well as for the call to the settings service. Go ahead and select simplebank-profile and click Find Tracesat the bottom. Figure 12.14 shows the traces for the profile service.
The page lists six traces, and all of them have three spans across two services. This means you were able to collect information about the collaboration between two internal services and to get timing information about the execution. Figure 12.15 shows a detailed view of one of those traces.
In figure 12.15, you can see a timeline for the different steps of execution in a call to the profile service. You have information about the overall time of execution, as well as when each suboperation took place and for how long. The spans contain information about the operation, the component that generated them, and their execution times and relative positions, both in the timeline and in regards to dependencies with parent spans.
This information can be invaluable in order to know what’s going on in a distributed system. You can now visualize the flow of requests through different services and know how long each operation takes to complete. This simple setup allows you to both understand the flow of execution in a microservice architecture and identify potential bottlenecks that you can improve.
You also can use Jaeger to understand how different components in your system relate to each other. The top navigation menu bar has a Dependencieslink. By clicking it and then, in the page that comes up, selecting the DAG (direct acyclic graph) tab, you have access to the view illustrated in figure 12.16.
The example we used was a simple one, but it allows you to understand the power of tracing in a microservice architecture. Along with logging and metrics, it allows you to have an informed view of both the performance and the behavior of your system.