12
Using logs and traces to understand behavior

This chapter covers

  • Storing logs in a consistent and structured way in a machine-readable format
  • Setting up a logging infrastructure
  • Using traces and correlation IDs to understand system behavior

In the previous chapter, we focused on emitting metrics from your services and using those metrics to create dashboards and alerts. Metrics and alerts are only one part of what you need to achieve observability in your microservice architecture. In this chapter, we’ll focus on collecting logs and making sure you’re able to trace the interactions between services. This will allow you to not only have an overview of how the system behaves but also go back in time and retrospectively follow each request. Doing so is important to debug errors and to identify bottlenecks. Logs give you a sort of paper trail that documents the history of each request entering your system, whereas traces provide you a way to establish a timeline for each request, to understand how much time it spent in different services.

By the end of this chapter, you’ll have created a basic logging infrastructure and set up the tracing capability. You’ll be able to both monitor the operation of your application and have the tools to audit and investigate in case you need to do so for particular requests. In addition, you’ll be able to identify performance issues by looking into tracing data.

12.1 Understanding behavior across services

In a microservices-based architecture, multiple services will be involved in providing functionality to users. It gets hard to understand what goes on with every request when you no longer have a central access point to data. Services are distributed across multiple nodes, are ephemeral, and are continuously being deployed and scaled to meet the needs of operation. Let’s revisit the sell order use case as you might have implemented it if you’d needed a single application running on a single machine (figure 12.1).

c12_01.png

Figure 12.1 The sell order use case implemented in a single application

In figure 12.1, we’ve represented each of the services that collaborate to allow a client to sell shares as modules in the same application. If you were to inspect a given request lifecycle in the system, you could log in to the machine and inspect log data stored on a hard drive. But you’d most likely have multiple machines running the application, for redundancy and availability, so things wouldn’t be as easy as logging in to one machine. Once you identified the request you were interested in observing, you’d have to identify which machine had run the request and then inspect it. Going through the logs from that machine would provide you needed insights.

Maintaining logs in a single machine is by no means easy — a server can also crash and become unavailable. Our aim here isn’t to talk about minimizing the complexity of keeping log data (or any data) safely persisted but to point out that having a single point for storing all the data makes it easier and more convenient to consult.

Let’s now compare the same scenario in a microservices application. Figure 12.2 illustrates the same use case with multiple services, each with multiple copies of itself, running independently.

As you can see below, you have five services running independently, and each of those services has three instances running. This means potentially none of the pods are executing on the same physical machine. A request coming into the system will most likely flow through multiple pods running on different physical machines, and you have no easy way to track down that request by accessing logs. Could you even do it? What guarantee do you have that any of the pods are still running once you need to access data?

c12_02.png

Figure 12.2 The sell order use case in SimpleBank with multiple services running, each with its own log data

c12_03.png

Figure 12.3 Accessing logs for each service in different running instances is challenging.

In figure 12.3, we illustrate the challenges you face when you try to gather data from a distributed system. Even if you had some sort of persistency for the log data that would allow it to survive after a running pod is replaced, it’d be no easy task to track down a request through your system. You need a better way to record what’s going on with your system. To be able to fully understand behavior, you need to

  • Make sure you persist log data so it survives through service restarts and scaling
  • Aggregate all log data from multiple services and instances of those services in a central location
  • Make the stored data usable, allowing for easy searches and further processing

Our objective by the end of this chapter will be to have an infrastructure that allows you to collect log data from all your services, aggregating it and allowing you to perform searches in it so you can, at any time, reason through the behavior of the system. You’ll be able to use the available data to audit, debug, or even gather new insights by processing it further. One example of the processing you can do to augment the available information is to collect the IP data stored in the logs and generate a visualization showing the most common geographic areas of your users.

To effectively store and make your log data searchable, you first need to agree on a format the engineering team will use. A consistent format will help to guarantee that you can store and process data effectively.

12.2 Generating consistent, structured, human-readable logs

To be able to achieve observability, you have to collect data from multiple sources; not only from running services but also from your infrastructure. Defining a common format allows you to analyze data more easily and perform searches in that data using existing tools with minimal effort. Examples of the data you may collect and use are:

  • Application logs
  • Database logs
  • Network logs
  • Performance data collected from the underlying operating system

For some components, you can’t control the format, so you have to cope with their specificities and transform them somehow. But for now, let’s focus on what you can control: your services. Making sure the whole engineering team abides by a format and a way of doing things pays off in the long run, because data collection will become simpler and more effective. Let’s start by determining what you should store; then we can look at how to store it.

12.2.1 Useful information to include in log entries

For your log data to be useful and effective in helping you to understand behavior in your systems, you need to make sure it includes certain information that will allow you to communicate certain things. Let’s look into what you should include as part of each log entry.

Timestamps

To be able to correlate data and order it appropriately, you need to make sure you attach timestamps to your log entries. Timestamps should be as granular and verbose as possible; for example, use four-digit years and the best resolution available. Each service should render its own timestamps, preferably in microseconds. Timestamps also should include a time zone, and it’s advisable that you collect data as GMT/UTC whenever possible.

Having these details allows you to avoid issues with correlating data from different services with different time zones. Ordering data by time of occurrence will become much easier and require less context while analyzing. Getting timestamps right is the first step in making sure you can understand the sequence in which events took place.

Identifiers

You should use unique identifiers whenever, and as much as, possible in the data you intend to log. Request IDs, user IDs, and other unique identifiers are invaluable when you’re cross-referencing data from multiple sources. They allow you to group data from different sources in an effective way.

Most of the time, these IDs already exist in your system because you need to use them to identify resources. It’s likely they’re already being propagated through different services, so you should make the best use out of them. Unique identifiers used alongside timestamps yield a powerful tool to understand the flow of events in a system.

Source

Identifying the source of a given log entry allows easier debugging when needed. Typical source data you can use includes:

  • Host
  • Class or module
  • Function
  • Filename

When adding execution times on a given function call, the information you’ve collected for the source allows you to infer performance because you can extrapolate execution times, even if not in real time. Although this isn’t a replacement for collecting metrics, it can be effective in helping to identify bottlenecks and potential performance issues.

Level or category

Each log entry should contain a category. The category can be either the type of data you’re logging or the log level. Typically, the following values are used as log levels: ERROR, DEBUG, INFO, WARN.

The category will allow you to group data. Some tools can parse log files searching for messages with the ERROR level and communicate them to error reporting systems. This is a perfect example of how you can use the log level or category to automate the process of error reporting without the need for explicit instructions.

12.2.2 Structure and readability

You want to generate log entries in a human-readable format, but at the same time they need to be easily parseable by a machine. What we mean by human readable is avoiding binary encoding of data or any type of encoding that your average human can’t understand. An example of this would be storing the binary representation of an image. You should probably use its ID, file size, and other associated data instead.

You also should avoid multiline logs because they can lead to fragmentation while parsing them in log aggregation tools. With such logs, it can be easy to lose some of the information associated with a particular log entry, like the ID, timestamp, or source.

For the examples in this chapter, you’ll be using JSON to encode your log entries. Doing so allows you to provide human-readable and machine-parseable data, as well as to automatically include some of the data we mentioned in the previous section.

In chapter 7, when we were discussing a microservice chassis, we introduced a Python library that provides log formatting: logstash-formatter. Logstash libraries are available for different languages, so you can expect the format to be widespread and easily usable no matter what language you chose to code your services in.

Let’s now look into a log entry collected using the Logstash library for Python. This message is formatted using V1 of the logstash format, and the application generated it automatically when it was booting, without the need for any specific code instruction to log it:

{ 
    "source_host" : "e7003378928a",    ①  
    "pathname" : "usrlocallibpython3.6site-packagesnamekorunners.py", 
➥"relativeCreated" : 386.46125793457031,  
    "levelno" : 20,  
    "msecs" : 118.99447441101074,    ②   
    "process" : 1,  
    "args" : [    "orders_service"  ],
    "name" : "nameko.runners",  
    "filename" : "runners.py",    ③  
    "funcName" : "start",    ③  
    "module" : "runners",    ③  
    "lineno" : 64,    ③  
    "@timestamp" : "2018-02-02T18:42:09.119Z",    ④  
    "@version" : 1,    ⑤  
    "message" : "starting services: orders_service",    ⑥  
    "levelname" : "INFO",    ⑦  
    "stack_info" : null,  
    "thread" : 140612517945416,   
    "processName" : "MainProcess",  
    "threadName" : "MainThread",  
    "msg" : "starting services: %s",  
    "created" : 1520275329.1189945
}

As you can see, the Logstash library inserts relevant information, taking that burden from the developer’s shoulders. In the following listing, you’ll see how an explicit log call in code renders a log entry.

Listing 12.1 Logstash V1 formatted log message after an explicit log instruction

# Python code for generating a log entry 
self.logger.info ({"message": "Placing sell order", 
"uuid": res})    ①  


{
    "@timestamp": "2018-02-02T18:43:08.221Z",
    "@version": 1,
    "source_host": "b0c90723c58f",
    "name": "root",
    "args": [],
    "levelname": "INFO",    ②  
    "levelno": 20,
    "pathname": "./app.py",
    "filename": "app.py",
    "module": "app",
    "stack_info": null,
    "lineno": 33,
    "funcName": "sell_shares",
    "created": 1520333830.3000789,
    "msecs": 300.0788688659668,
    "relativeCreated": 15495.944738388062,
    "thread": 140456577504064,
    "threadName": "GreenThread-2",
    "processName": "MainProcess",
    "process": 1,
    "message": "Placing sell order",    ③  
    "uuid": "a95d17ac-f2b5-4f2c-8e8e-2a3f07c68cf2"    ④  
}

In your explicit call to the logger, you’ve only stated the desired level and the message to log in the form of key-value pairs containing a message and a UUID. Logstash automatically collected and added all the other information present in the log entry without you having to declare it explicitly.

12.3 Setting up a logging infrastructure for SimpleBank

Now that you’ve set a format for collecting and presenting info, you can move on to creating a basic logging infrastructure. In this section, you’ll be setting up the infrastructure that’ll allow you to collect logs from all the running services and aggregate them. It also will provide you with search and correlation capabilities. The purpose is to have a central access point to all the log data like you already have for metrics. In figure 12.4, we illustrate what you want to achieve after setting up a logging infrastructure.

c12_04.png

Figure 12.4 Services making use of centralized metrics and a centralized log for easy access to data

Once you set up the log aggregation capability, like you did for metrics in chapter 11, you’ll have all services sending both metrics and logs to centralized systems that’ll allow you to improve observability. You’ll be able to observe data about a running system and dig deeper to collect more information in case you need to audit or debug a particular request. You’ll set up a solution commonly called ELK (Elasticsearch, Logstash, and Kibana) and will use a data collector called Fluentd.

12.3.1 ELK- and Fluentd-based solution

You’ll build the logging infrastructure we propose using Elasticsearch, Logstash, and Kibana. Also, you’ll use Fluentd for pushing logs from the apps to your centralized logging solution. Before we get into more details about these technologies, have a look at figure 12.5 to get an overview of what we want to enable you to achieve.

In figure 12.5, you can see how you can collect the logs for multiple instances of the gateway service and forward them to your centralized logging system. We represent multiple instances of the same service, but this will work for any of the services you have running. Services will redirect all the log information to STDOUT (standard output), and an agent running the Fluentd daemon will be responsible for pushing those logs into Elasticsearch.

c12_05.png

Figure 12.5 Collecting logs from multiple service instances and forwarding them to a centralized location

By following this pattern when deploying any new services, you’ll make sure log data gets collected and indexed and becomes searchable. But before we move to implementation, we’ll take a little time to introduce each of the technologies you’ll be using.

Elasticsearch

Elasticsearch (www.elastic.co/products/elasticsearch) is a search and analytics engine that stores data centrally. It indexes data, in your case log data, and allows you to perform efficient search and aggregation operations on the data it has stored.

Logstash

Logstash (www.elastic.co/products/logstash) is a server-side processing pipeline that allows data ingestion from multiple sources and transforms that data prior to sending it to Elasticsearch. In your case, you’ll be using the Logstash formatting and data collection capabilities by taking advantage of client libraries. In this chapter’s earlier examples, you already observed its ability to provide consistent data that you can send to Elasticsearch. But here you won’t be using Logstash to send data; you’ll be using Fluentd instead.

Kibana

Kibana (www.elastic.co/products/kibana) is a UI for visualizing Elasticsearch data. It’s a tool you can use to query data and explore its associations. In your use case, it’ll operate on log data. You can use Kibana to derive visualizations from gathered data, so it’s more than a search tool. Figure 12.6 shows an example of a dashboard powered by Kibana.

c12_06.png

Figure 12.6 Kibana tutorial dashboard showing visualizations created from log data

Fluentd

Fluentd (www.fluentd.org) is an open source data collector that you’ll be using to push data from your services to Elasticsearch. You’ll combine the data formatting and collecting capabilities of Logstash and use Fluentd to push that data. One of its advantages is the fact that you can use it as a logging provider for Dockerfiles if you declare it in Docker compose files.

12.3.2 Setting up your logging solution

You’ll set up your solution via the Docker compose file, like you already did in chapter 11 to create the metrics collecting and alert infrastructure. You can find all the code used in this chapter at Github (http://mng.bz/k191). There, you’ll find the docker-compose.yml file, where you’ll be declaring the new dependencies. The following listing shows the new components added to the compose file.

Listing 12.2 Docker compose file with Elasticsearch, Kibana, and Fluentd containers

version: '2.1'
services:

  gateway:
    container_name: simplebank-gateway
    restart: always
    build: ./gateway
    ports:
      - 5001:5000
    volumes:
      - ./gateway:/usr/src/app
    links:
      - "rabbitmq:simplebank-rabbitmq"
      - "fluentd”
    logging:    ①  
      driver: "fluentd" 
      options:
        fluentd-address: localhost:24224
        tag: simplebank.gateway

(…)
  kibana:
    image: kibana    ②  
    links:
      - "elasticsearch"    ③  
    ports:
      - "5601:5601"

  elasticsearch:
    image: elasticsearch    ④  
    expose:
      - 9200
    ports:
      - "9200:9200"

  fluentd:
    build: ./fluentd    ⑤  
    volumes:
      - ./fluentd/conf:/fluentd/etc    ⑥  
    links:
      - "elasticsearch"    ⑦  
    ports:
      - "24224:24224"
      - "24224:24224/udp"
(…)

Once you’ve added this content to the docker compose file, you’re almost ready to boot your logging infrastructure alongside your services. But first let’s cover the missing tweaks we mentioned previously that’ll allow you to configure Fluentd to your needs. The Dockerfile you use for building Fluentd follows.

Listing 12.3 Fluentd Dockerfile (Fluentd/Dockerfile)

FROM fluent/fluentd:v0.12-debian    ①  
RUN ["gem", "install", "fluent-plugin-elasticsearch", 
➥"--no-rdoc", "--no-ri", "--version", "1.9.2"]    ②  

Now all you need is to create a configuration file for Fluentd. The following listing shows the config file.

Listing 12.4 Fluentd configuration (fluentd/conf/fluent.conf)

<source>    ①  
  @type forward    ②  
  port 24224
  bind 0.0.0.0
</source>
<match *.**>    ③  
  @type copy    ④  
  <store>
    @type elasticsearch    ④  
    host elasticsearch
    port 9200
    logstash_format true
    logstash_prefix fluentd
    logstash_dateformat %Y%m%d
    include_tag_key true
    type_name access_log
    tag_key @log_name
    flush_interval 1s
  </store>
  <store>
    @type stdout    ④  
  </store>
</match>

The match section in the Fluentd configuration file contains all the needed configuration for connecting with elasticsearch, port, and host, as well as the format used. You’re using logstash format, as you may recall.

With all the needed setup done, you’re now ready to boot your services using the new Docker compose file. Before doing so, let’s go through your services and change code to enable you to send data to your centralized logging infrastructure. In the next section, you’ll configure your services to make use of the Logstash logger. You’ll also set log levels.

12.3.3 Configure what logs to collect

In your services, you can control the log level via environment variables, so you can have different levels for development and production environments. Using different log levels allows you to enable more verbose logs in production, in case you need to investigate any issue.

Let’s look at your gateway service for its logging configuration and also at the service code to understand how you can emit log messages. The logging configuration is shown in the following listing.

Listing 12.5 Configuration file for the gateway service (gateway/config.yml)

AMQP_URI: amqp://${RABBIT_USER:guest}:${RABBIT_PASSWORD:guest}@${RABBIT_HOST:
➥localhost}:${RABBIT_PORT:5672}/
WEB_SERVER_ADDRESS: '0.0.0.0:5000'
RPC_EXCHANGE: 'simplebank-rpc'
LOGGING:    ①  
    version: 1
    handlers:
        console:
            class: logging.StreamHandler    ②  
    root:
        level: ${LOG_LEVEL:INFO}    ③  
        handlers: [console]    ④  

This configuration will allow setting the log level when the application boots. In the Docker compose file, you’ve set the environment variable LOG_LEVEL as INFO for all services except for the Gateway, which has a DEBUG value. Let’s now look into the gateway code to set up logging, as shown in the following listing.

Listing 12.6 Enable logging in the gateway service (gateway/app.py)

import datetime
import json
import logging    ①  
import uuid
from logstash_formatter import LogstashFormatterV1    ②  
from nameko.rpc import RpcProxy, rpc
from nameko.web.handlers import http
from statsd import StatsClient
from werkzeug.wrappers import Request, Response

class Gateway:

    name = "gateway"
    orders = RpcProxy("orders_service")
    statsd = StatsClient('statsd', 8125,
                         prefix='simplebank-demo.gateway')
    logger = logging.getLogger()    ③  
    handler = logging.StreamHandler()    ③  
    formatter = LogstashFormatterV1()    ③  
    handler.setFormatter(formatter)    ③  
    logger.addHandler(handler)    ③  

    @http('POST', '/shares/sell')
    @statsd.timer('sell_shares')
    def sell_shares(self, request):
        req_id = uuid.uuid4()
        res = u"{}".format(req_id)

        self.logger.debug(
            "this is a debug message from gateway", 
➥extra={"uuid": res})    ④  

        self.logger.info("placing sell order", extra=
➥{"uuid": res})    ⑤  

        self.__sell_shares(res)

        return Response(json.dumps(
            {"ok": "sell order {} placed".format(req_id)}),
            ➥mimetype='application/json')

    @rpc
    def __sell_shares(self, uuid):
        self.logger.info("contacting orders service", extra={
            ➥"uuid": uuid})    ⑤  

        res = u"{}".format(uuid)
        return self.orders.sell_shares(res)

    @http('GET', '/health')
    @statsd.timer('health')
    def health(self, _request):
        return json.dumps({'ok': datetime.datetime.utcnow().__str__()})

In listing 12.2, you saw how to enable the Fluentd driver for logging with Docker. This means you’re ready to send log data generated from your services to Elasticsearch using Fluentd, and afterwards you’ll be able to explore that log data using Kibana. To start all the services, metrics, and logging infrastructure from a console in the root directory, execute the following command:

docker-compose up --build --remove-orphans

Once all is ready, you need to complete one last step, which is configuring Kibana to use the logs collected via Fluentd and stored in Elasticsearch. To do so, access the Kibana web dashboard (http://localhost:5601). On the first access, you’ll be redirected to the management page, where you’ll need to configure an index pattern. You need to tell Kibana where to find your data. If you recall, in the Fluentd configuration, you set an option for the logstash prefix with the value fluentd. This is what you need to enter in the index text box presented to you. Figure 12.7 shows the Kibana dashboard management section and the value you need to input.

After inserting fluentd-* as the index pattern and clicking the Create button, you’ll be ready to explore all the log data that your multiple services create. Elasticsearch will forward all data to a central location, and you’ll be able to access it in a convenient way.

c12_07.png

Figure 12.7 The Kibana management section where you need to indicate the index pattern for it to be able to fetch data from Elasticsearch

To generate some log data, all you need to do is create a sell request to your gateway service. To do so, you need to issue a POST request to the gateway. The following shows presenting a request via curl, but any tool capable of generating a POST request will do:

chapter-12$ curl -X POST http://localhost:5001/shares/sell 
  -H 'cache-control: no-cache' 
  -H 'content-type: application/json'    ①  
chapter-12$ {"ok": "sell order e11f4713-8bd8-4882-b645
-55f96d220e44 placed"}    ②  

Now that you have log data collected, you can explore it using Kibana. Clicking on the Discover section on the left side of Kibana’s web dashboard will take you to a page where you can perform searches. In the search box, insert the request UUID you received as the sell order response. In the case of this example, you’d be using e11f4713-8bd8-4882-b645-55f96d220e44 as your search parameter. In the next section, we’ll show you how to use Kibana to follow the execution of a sell request through the different services involved.

12.3.4 Finding a needle in the haystack

In the code example for this chapter, you have five independent services collaborating to allow SimpleBank users to sell shares. All services are logging their operations and using the request UUID as a unique identifier to allow you to aggregate all log messages referring to a sell order that’s processing. You can make use of Kibana to explore your logs and track down the execution of a request. In figure 12.8, you use the order ID that the gateway service returns to perform the search.

When you use the request ID as the search parameter, Kibana filters the log data, and you get 11 hits that allow you to follow the execution of a request through different services. Kibana allows you to use complex queries to be able to uncover insights. You get the ability to filter per service, sort by time, and even use log data — for example, execution times present in the log entries — to create dashboards to track performance. This use is beyond the scope of this chapter, but do feel free to explore the possibilities offered to get new perspectives on the collected data.

We’ll now zoom in a bit and focus on some of the log entries that your query shows. Figure 12.9 shows some of those entries with a bit more detail.

c12_08.png

Figure 12.8 Searching the log data using a request ID

c12_09.png

Figure 12.9 Detailed view of log messages in the Kibana search page

In figure 12.9, you’ll find messages from the market and gateway services. For the latter, given that the log level selected was DEBUG, you’ll find both info and debug messages. As mentioned previously, by using the logstash-formatter library in your Python code, you get more information for free. You can find data regarding execution times and scoping of the execution by module, function, line, process, and thread. All of this info can be useful if you need to diagnose any issues in the future.

12.3.5 Logging the right information

Now that you’re able to collect logs and store them, you need to be careful about what information you send via logs. Things like passwords, credit card numbers, ID card numbers, and potentially sensitive personal data that get sent will be stored and accessible by anyone who can use the logging infrastructure. In your case, you’re hosting and controlling the logging infrastructure, but if you were using a third-party provider, you’d need to pay extra attention to these details. You have no easy way to delete only some of the data already sent. In most cases, if you want to delete something specific, it means deleting all the log data for a given period of time.

Data privacy is a hot topic at the moment, and with the EU General Data Protection Regulation (GDPR) (www.eugdpr.org) now in effect, you need to take extra care when considering which data to log and how to log it. We won’t explore the needed steps in depth here, but both Fluentd and Elasticsearch allow you to apply filtering to data so that any sensitive fields get masked, encrypted, or removed from the data that they receive and that Elasticsearch indexes. The general rule would be to log as little information as you can, avoid any personal data in logs, and take extra care with reviewing what gets logged before any changes make it into the production environment. Once your services send data, it’s hard to erase it and doing so will have associated costs.

That said, you can and should use logs to communicate useful information to allow you to understand system behavior. Sending IDs that allow you to correlate actions of different systems and terse log messages indicating actions performed in or by systems can help you keep track of what’s happened.

12.4 Tracing interactions between services

When you were setting up your log infrastructure and making the code changes to emit log messages, you already took care of propagating an ID field that allows you to follow the execution path of a request through your system. With this set up, you can group log entries under the same context. You may even use log data to create visualizations that’ll allow you to understand how much time each component took to process a request. In addition, you can use it to help you identify bottlenecks and places where you can improve your code to have extra performance. But logs aren’t the only tool available — you have another method at your disposal for doing this that doesn’t rely on log data.

You can do better by reconstructing the journey of requests through your microservices. In this section, you’ll be setting up distributed tracing to allow you to visualize the flow of execution between services and at the same time, provide insights on how long each operation takes. This can be valuable, not only to understand the order in which a request flows through multiple services, but also to identify possible bottlenecks. For this, you’ll use Jaeger and libraries compatible with the OpenTracing API (http://opentracing.io).

12.4.1 Correlating requests: traces and spans

A trace is a direct acyclic graph (DAG) of one or more spans, where the edges of those spans are called references. Traces are used to aggregate and correlate execution flow through the system. To do so, some information needs to be propagated. A trace captures the whole flow.

Let’s look at figures 12.10 and 12.11. In these figures, we represent a trace made up of multiple spans, from both a dependency perspective and a temporal perspective.

c12_10.png

Figure 12.10 A trace made up of eight different spans from a dependency perspective

c12_11.png

Figure 12.11 The temporal relationships in an eight-span trace

In figure 12.10, you can observe the dependency relationship between different spans. These spans can be triggered either in the same application or in different ones. The only requirement is for the parent span ID to be propagated, so when a new span is triggered, it’ll hold a reference to its parent span.

In figure 12.11, you have a view of spans from a temporal perspective. By using temporal information contained in spans, you can organize them in a timeline. You can see not only when each span happened relative to other spans but also how long each operation that a span encapsulates took to complete.

Each span contains the following information:

  • An operation name
  • A start and a finish timestamp
  • Zero or more span tags (key value pairs)
  • Zero or more span logs (key value pairs with a timestamp)
  • A span context
  • References to zero or more spans (via the span context)

The span context contains the needed data to refer to different spans, either locally or across service boundaries.

Let’s now move on to setting up tracing between services. You’ll be using Jaeger (www.jaegertracing.io), a distributed tracing system, as well as a set of Python libraries that are OpenTracing compatible.

12.4.2 Setting up tracing in your services

To be able to display tracing information and correlate requests between different services, you’ll need to set up a collector and a UI for traces, as well as including some libraries and setting them up in your services. The services we’ll use as an example for distributed tracing will be the SimpleBank profile and settings services. In figure 12.12, we give an overview of the interactions you’ll be tracing.

c12_12.png

Figure 12.12 Interactions of the profile service

The profile service will contact an external service, in this case jsontest.com, to retrieve its IP and will also be fetching user settings from the settings service. You’ll be setting up a tracing system (Jaeger) and making the code changes needed to display the trace and its spans and to be able to correlate those spans. Correlating them will allow you to understand in detail how long each operation took and how it contributed to the overall execution time of a call to the profile service. You’ll begin by setting up Jaeger, the distributed tracing system collector and UI, by adding a Docker image into your docker-compose.yml file (listing 12.7).

Listing 12.7 Adding Jaeger to the docker-compose.yml file

(…)
jaeger:
    container_name: jaeger
    image: jaegertracing/all-in-one:latest    ①  
    ports:
       - 5775:5775/udp    ②  
       - 6831:6831/udp
       - 6832:6832/udp
       - 5778:5778
       - 16686:16686    ③  
       - 14268:14268
       - 9411:9411    ④  
    environment:
      COLLECTOR_ZIPKIN_HTTP_PORT: "9411"    ④  

With the Docker image added to your docker-compose file, once you boot all the SimpleBank infrastructure, you’ll have a distributed tracing system in place. Now you need to make sure that the SimpleBank profile and settings services are able to create traces and spans and communicate them to Jaeger.

Let’s add the needed libraries to both the settings and profile services and initialize the tracer. The following listing adds the tracing libraries.

Listing 12.8 Adding the tracing libraries to the services via a requirements.txt file

Flask==0.12.0
requests==2.18.4
jaeger-client==3.7.1    ①  
opentracing>=1.2,<2    ②  
opentracing_instrumentation>=2.2,<3    ③  

By adding these libraries, you’re now able to create traces and spans from both services. To make the process easier, you can also create a module to provide a convenient setup function to initialize the tracer, as shown in the following listing.

Listing 12.9 Tracer initializer lib/tracing.py

import logging
from jaeger_client import Config    ①  
def init_tracer(service):    ②  
    logging.getLogger('').handlers = []
    logging.basicConfig(format='%(message)s', level=logging.DEBUG)

    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'local_agent': {    ③  
                'reporting_host': "jaeger",
                'reporting_port': 5775,
            },
            'logging': True,    ④  
            'reporter_batch_size': 1,
        },
        service_name=service,    ⑤  
    )
    return config.initialize_tracer()

The SimpleBank profile and settings services will both be using the tracer initialization function shown in listing 12.9. This allows them to establish the connection to Jaeger. As we showed in figure 12.12, the profile service contacts both an external service and the settings internal service. You’ll be tracing the interaction of the profile service with both of these collaborators. In the case of the interaction with the SimpleBank settings service, you’ll need to pass along the context of the initial trace so you can visualize the full cycle of a request.

Listing 12.10 shows the profile service code where you set up the spans for both the external http service and the settings service. For the former, you create a span, and for the latter, you pass along the current span as a header so the settings service can make use of it and create child spans.

Listing 12.10 Profile service code

from urlparse import urljoin

import opentracing
import requests
from flask import Flask, jsonify, request
from opentracing.ext import tags    ①  
from opentracing.propagation import Format    ①  
from opentracing_instrumentation.request_context import 
➥get_current_span    ①  
from opentracing_instrumentation.request_context import 
➥span_in_context    ①  
from lib.tracing import init_tracer    ②  

app = Flask(__name__)
tracer = init_tracer('simplebank-profile')    ③  

@app.route('/profile/<uuid:uuid>')
def profile(uuid):
    with tracer.start_span('settings') as span:    ④  
        span.set_tag('uuid', uuid)
        with span_in_context(span):
            ip = get_ip(uuid)
            settings = get_user_settings(uuid)
            return jsonify({'ip': ip, 'settings': settings})

def get_ip(uuid):
    with tracer.start_span('get_ip', child_of=
➥get_current_span()) as span:    ⑤  
        span.set_tag('uuid', uuid) 
        with span_in_context(span):    ⑥  
            jsontest_url = "http://ip.jsontest.com/"
            r = requests.get(jsontest_url)
            return r.json()
def get_user_settings(uuid):
    settings_url = urljoin("http://settings:5000/
➥settings/", "{}".format(uuid))
    span = get_current_span()    ⑦  
    span.set_tag(tags.HTTP_METHOD, 'GET')    ⑧  
    span.set_tag(tags.HTTP_URL, settings_url)    ⑧  
    span.set_tag(tags.SPAN_KIND, tags.SPAN_KIND_RPC
➥_CLIENT)    ⑧  
    span.set_tag('uuid', uuid)    ⑧  
    headers = {}
    tracer.inject(span, Format.HTTP_HEADERS, headers)    ⑧  

    r = requests.get(settings_url, headers=headers)
    return r.json()

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)

The SimpleBank profile service initializes a trace that’ll be used to group different spans. It creates spans for the calls to "http://ip.jsontest.com/" and for the call to the SimpleBank settings service. For the former, given that you don’t own the service, you execute the call wrapped in a span. But for the latter, because you control it, you can pass on span information that’ll be used to create child spans. This will allow you to group all related calls in Jaeger.

Let’s now look into how you can make use of the injected span in the SimpleBank settings service, as shown in the following listing.

Listing 12.11 Using a parent span in the settings service

import time
from random import randint

import requests
from flask import Flask, jsonify, request
from opentracing.ext import tags
from opentracing.propagation import Format
from opentracing_instrumentation.request_context import get_current_span
from opentracing_instrumentation.request_context import span_in_context

from lib.tracing import init_tracer

app = Flask(__name__)
tracer = init_tracer('simplebank-settings')    ①  

@app.route('/settings/<uuid:uuid>')
def settings(uuid):
    span_ctx = tracer.extract(Format.HTTP_HEADERS, 
➥request.headers)    ②  
    span_tags = {tags.SPAN_KIND: tags.SPAN_KIND_RPC
➥_SERVER, 'uuid': uuid}    ③  
    with tracer.start_span('settings', child_of=span
➥_ctx, tags=span_tags):    ④  
        time.sleep(randint(0, 2))
        return jsonify({'settings': {'name': 'demo user', 'uuid': uuid}})

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)

By extracting the span context from the request the settings service receives, you can then make use of it as the parent to new spans. You can later visualize this new child span independently. But you’ll also be able to take advantage of the fact that Jaeger will show the span as both an independent span in the context of the SimpleBank settings service and a child span in the context of the SimpleBank profile service.

12.5 Visualizing traces

With all the setup out of the way, all you need to do to start collecting traces is to issue a request to the SimpleBank profile endpoint. You can use the command line or a browser. To access traces via the command line, you can use curl to issue the following request:

$ curl http://localhost:5007/profile/26bc34c2-5959-4679-9d4d-491be0f3c0c0
{
  "ip": {
    "ip": "178.166.53.17"
  },
  "settings": {
    "settings": {
      "name": "demo user",
      "uuid": "26bc34c2-5959-4679-9d4d-491be0f3c0c0"
    }
  }
}

Here’s a brief recap of what’s going on when you hit the profile endpoint:

  • The profile service creates a span A.
  • The profile service contacts an external service to fetch the IP, wrapping it under a new span B.
  • The profile service contacts the internal SimpleBank settings service to get user info under a new span C and passes the context of the parent span to the downstream service.
  • Both services communicate spans to Jaeger.
c12_13.png

Figure 12.13 Jaeger UI search page showing the services that have traces available

To visualize the traces, you need to access the Jaeger UI that’ll be running on port 16686. Figure 12.13 shows the Jaeger UI and the list of services that have traces available.

In the Service section, you see three services for which trace information is available: two SimpleBank services and one called jaeger-query. The latter gathers Jaeger internal traces and is of little use to you. You’re interested in the other two services listed: simplebank-profile and simplebank-settings. If you recall, the profile service was creating spans for the execution of an external call, as well as for the call to the settings service. Go ahead and select simplebank-profile and click Find Tracesat the bottom. Figure 12.14 shows the traces for the profile service.

The page lists six traces, and all of them have three spans across two services. This means you were able to collect information about the collaboration between two internal services and to get timing information about the execution. Figure 12.15 shows a detailed view of one of those traces.

c12_14.png

Figure 12.14 The simplebank-profile traces information

c12_15.png

Figure 12.15 The timing information and execution sequence of a call to the profile service

In figure 12.15, you can see a timeline for the different steps of execution in a call to the profile service. You have information about the overall time of execution, as well as when each suboperation took place and for how long. The spans contain information about the operation, the component that generated them, and their execution times and relative positions, both in the timeline and in regards to dependencies with parent spans.

This information can be invaluable in order to know what’s going on in a distributed system. You can now visualize the flow of requests through different services and know how long each operation takes to complete. This simple setup allows you to both understand the flow of execution in a microservice architecture and identify potential bottlenecks that you can improve.

You also can use Jaeger to understand how different components in your system relate to each other. The top navigation menu bar has a Dependencieslink. By clicking it and then, in the page that comes up, selecting the DAG (direct acyclic graph) tab, you have access to the view illustrated in figure 12.16.

The example we used was a simple one, but it allows you to understand the power of tracing in a microservice architecture. Along with logging and metrics, it allows you to have an informed view of both the performance and the behavior of your system.

c12_16.png

Figure 12.16 The service dependency view in the Jaeger UI

Summary

  • You can set up a logging infrastructure using Elasticsearch, Kibana, and Fluentd, and distributed tracing using Jaeger.
  • A logging infrastructure can generate, forward, and store indexed log data that allows searching and correlating requests.
  • Distributed tracing allows you to follow the journey of execution of requests through different microservices.
  • Alongside metrics collection, tracing allows you to better understand how the system is behaving, identify potential issues, and audit your system anytime.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset