The Observable System

There are two key technical aspects to building a production-worthy system:

  • We must build a system that scales sufficiently, and that can handle the business requirements. In most of this lesson, we focus on this aspect. There are patterns that support safely introducing distribution into a system, which in turn supports easier distribution of work. We largely wave away discussions of capacity planning, because our main goal is to avoid introducing single points of failure, not to handle Google scale. A sufficiently decomposed system may still falter at much greater scale, but you can address those issues when you come to them.

    When you meet these requirements, be clear about what you mean by scalability. There are many different ways to characterize scalability in a system, but unless you are sure of where your system is, you won’t be able to get it where you want it to go. You can use these measurements to feed into a model of your system’s scalability. One very useful model is defined in the Universal Scalability Law, which Baron Schwartz describes in his book Practical Scalability Analysis with the Universal Scalability Law. Coda Hale created a useful modeling library, usl4j, that—given observations (measurements) about any two dimensions of Little’s law—gives you a model for predicting any dimension from any other dimension. Such a model supports scaling to meet particular requirements. Here, a cloud platform like Cloud Foundry can make life easier as it supports easy horizontal scaling. We’ll look at how to measure a system in this lesson. These measurements feed into scalability models.

  • We must build a system that does the right thing when things don’t work as expected. An application must support ease of remediation.

    Both of these requirements demand that we have visibility into the system—that we be able to measure the system. Beyond the technical requirements, visibility supports the business, as well. If we’re to truly iterate in an agile fashion, then software must be shippable and releasable, if not released, to the customer after every iteration. The “customer” may be a client, a nonprofit, an open source project, or yourself. The “customer” describes whatever draws value from the software. The sooner that software is released to the customer, the sooner it releases value. Released software derisks continued development by capturing business value. Released and working software derisks continued development by capturing business value.

How do we know if it’s working? Software is silent. It runs just as quietly when it’s working as when it’s dead. There are no telltale sounds or smells that signal its malfunction. When we build software, we need to build in a definition of working. This helps us set expectations for normal operating behavior—a baseline. A baseline serves as a basis for measured improvements in behavior.

In this lesson, we’ll look at the nonfunctional or cross-functional capabilities that all applications need if we’re to have a hope of operationalizing them. The majority of this lesson relates to observability—a measure for how well internal states of a system can be inferred by knowledge of its external outputs.

The rub is that these capabilities are not business differentiators; they’re not what your organization went into business to address! Yet, they’re critical to the continued and safe operation and evolution of an application. Michael Nygard, in his epic tome Release It! (Pragmatic), details these sorts of capabilities and concerns in dizzying detail. The punchline is that code-complete is not the same as production-ready. Software can not be released if it is not production-ready. Done must mean done.

You Build It, You Run It

The developer’s instinct historically has been to ignore these sort of nonfunctional requirements. After all, if they don’t contribute to the feature set and the customers aren’t charmed by their presence, why focus on these things at all? Certainly, they shouldn’t be a subject of concern at project inception! Developers didn’t want to be bothered with code changes to support these requirements when they had business-differentiating functionality to add and an ever-growing backlog. It wasn’t their problem if the software failed in the dead of night; the help desk would handle it! No help desk or operations team wants to be saddled with a black-box piece of software (into which they have absolutely no visibility) that’s malfunctioning at 4 a.m., so they cared deeply about observability and did what they could, save change the code itself, to support it. The result was that developers would write software and chuck it over an imaginary wall to operations, and operations would instrument what they could without changing the code.

If ever developers and operations did become aware of each other, it was usually a point of contention over some perceived loss of autonomy that developers felt because of the way the application had been operationalized. Developers wanted to release code, and operations wanted to ensure the stability of the production system. Such a dynamic is not ideal. It results in a disconnect between the business outcome and the stability of the systems. Today, of course, we talk about DevOps; the idea that both operations and developers are charged with ensuring the stability of the system and with business outcomes. They are two sides of the same coin, working toward shared goals. Many high-performing organizations have adopted a simple mantra as a way of reducing the go-between time between developers and operations: “you build it, you run it.” Teams are now charged with the maintenance of their code in production.

You will be a lot more interested in building robust systems with no failure points that support observability and remediation if you know you might be awakened at 4 a.m. to support them!

Murder Mystery Microservices

The need to support observability becomes even more critical in the cloud, when building distributed systems. With Microservices, every outage is more like a murder mystery. Increment Magazine (which is a fascinating new foray into the world of on-call operations) looked at “What happens when the pager goes off?”. They describe the fairly standardized, high-level framework that a large swath of high-performing organizations across the industry follow in response to an incident. Broadly, here are the basic steps:

Triage

Somewhere in the system, something is wrong. In this stage, the job is to identify where it might be malfunctioning, to assess the impact and severity of the incident and to classify it.

Coordinate

In this phase, work needs to begin toward mitigation. This work may be done by another team (the developers) or the same team that triaged the incident. Typically, groups of people communicate through established channels (chat, Slack, Skype, Google Hangouts, etc.). Work underway must be documented and shared. Many organizations create war rooms and set up conference calls.

Mitigate

At this stage, the goal is to reduce the impact of the incident and to restore system stability, not to resolve the root cause and not to fix the root of the problem. As an example, if a system has failed because of a deployment, the mitigation step would be to roll back the change, not try to fix the underlying problem.

Resolve

Mitigation may only stop the incident in its tracks, preventing it from further impacting other users, but not fix the root cause. In the Resolve stage, developers do root-cause analysis and resolution. Existing users may still be affected, requiring resolution. Time is of the essence here, but teams must be careful to observe normal quality-guarding practices like testing any hotfixes. Many organizations measure both mean time to mitigation (MTTM) and mean time to resolution (MTTR).

Follow-up

In this stage, an organization attempts to internalize lessons from the incident by doing blameless postmortems, making and assigning follow-up tasks, and holding incident review meetings. The incident is considered over only once all the follow-up tasks are done.

At every step, it is incredibly important to have support in the organization for communication between people and for visibility into the system itself.

Twelve-Factor Operations

An operationalized application is one that is built for production. Such an application will, as we’ll see in this lesson, work with a diverse ecosystem of tools (centralized log processing, health and process managers, job schedulers, distributed tracing systems, and so on) beyond the business-differentiating functionality that is the essence of the application.

We will, by building our application with key tenets from the twelve-factor manifesto in mind, and by working with the conventions of Spring Boot and Spring Cloud, benefit from this infrastructure if it is available. Make no mistake, however, that something needs to supply that infrastructure. Deploying into production blind, with no supporting infrastructure, is not an option. If the twelve-factor manifesto describes a set of good, clean, cloud-hygiene principles for building production-ready applications, then something needs to satisfy the other side of the contract and support what Andrew Clay Shafer refers to as twelve-factor operations. Something needs to run apps and services as stateless processes, provide a well-known application life cycle, make it easy to externalize configuration, support log management, provide backing services, make it easy to scale applications horizontally, provide declarative port binding, etc. Cloud Foundry, naturally, does a very good job supporting the operational requirements.

In this lesson we’ll look at how to surface node-by-node information and how to centralize that information to support the single-pane-of-glass experience required for quick comprehension of a system’s behavior. It is critical that we capture the behavior of the system, not just the applications in the system.

The map is not the terrain. Just as looking at a map of Manhattan has far less fidelity than actually walking through Manhattan, a system has emergent behavior that cannot be captured in an architecture diagram of the system. It can only be captured through effective, systemwide monitoring.

The New Deal

The requirements to successfully deploy applications into production have not changed drastically. What has changed is how divorced developers can afford to be from operational concerns, if the organization is to prosper, and how apathetic operations can be to application requirements that may risk system stability. The handoff between developers and operations used to be an opaque deliverable, something like a Servlet container-compatible .war. The application deployed in this black box benefited from some container-provided services, like a built-in start and stop script and a central log spout. Operators needed to further customize such a container in order for any of those container promises to be meaningful. Operators would then need to ensure a whole world of supporting infrastructure was in place so that the application would enjoy stability in production:

Process scheduling and management
What component will start the application and gracefully shut it down? How do we ensure that it isn’t running twice on the same host?
Application health and remediation
How do operators know if the application is running well? What happens if the application process dies? What happens if the host itself dies?
Log management
How do operators see the logs spooling from application instances? How are collection and analysis handled?
Application visibility and transparency
How do operators capture application state, or quantify application state as metrics, and analyze and visualize them?
Route management
Is the application exposed to the internet? Load-balanced? Do the routes update correctly when the application is restarted on another host?
Distributed tracing
Who is accessing the system? Which services are involved in processing a transaction, and what is the latency of those requests? What does the average call graph look like?
Application performance management
How do operators diagnose and address complex application performance problems?

“A Bad System Will Beat a Good Person Every Time”, by W. Edwards Deming

Deming said that in the context of people, not about distributed systems or the systems that run it, but we think it equally applies there. If the system we use to manage our software makes doing the right thing difficult, then people won’t do it.

It is critical that supporting observability and these best practices be as friction-free as possible so that it is a no-brainer to introduce it consistently across services and projects. Organizations large and small know the dreaded corporate Wiki page (“The 500 Easy Steps to Production”), full of boilerplate and manual work to be done before a service may be deployed. Here, Cloud Foundry and Spring Boot stand strong. Cloud Foundry provides support for the twelve-factor operations requirements, and Spring Boot (and its auto-configuration) codifies the application and customization of the principles of the twelve-factor application. They reduce the cognitive overhead of doing the right thing to almost nil. We need only get it right once, and reuse from there. Undifferentiated heavy lifting is the enemy of velocity.

This infrastructure is hard to get right and expensive to develop. It is also, very importantly, not business-differentiating functionality. We believe that opinionated platforms, like Heroku or distributions of the open source Cloud Foundry, provide the best mix of supporting infrastructure and ease of use. These platforms make certain assumptions about the applications—that they are transactional, online web applications written in compliance with the tenets of the twelve-factor manifesto. These assumptions make it easy for the platform to meet certain requirements; they reduce the surface area for variability in a codebase. These assumptions support automation and increase velocity.

Observability

Observability is a tricky thing. If you have a monolithic application, things are in some ways easier. If something goes wrong with the system, there can be no question at least of where the error occurred: the error is coming from inside the building! Things become markedly more complicated in a distributed system as interactions between components make failure isolation critical. You wouldn’t drive a single passenger car without instrumentation, gauges, and windows supporting visibility; how could you hope to operate air traffic control for hundreds of airplanes without visibility?

Improved visibility supports business’s continued investment and changing priorities for a system. Operations uses good visibility to at least connect eyes to potential system problems (alerting) and to aid in response to incidents. In some cases, they can also use good observability to make automatic or at least push-button-simple the response to incidents.

Ideally, operational and business visibility can be correlated and used to drive a single-pane-of-glass experience—a dashboard. In this lesson, we’ll look at how to collect and understand historical and present status and how to support forward-looking predictions. Predictions, in particular, are driven by models based on historical data.

Historical data is data stored somewhere for a period of time. Historical data may drive long-term business insight, where fresher, more recent data may support operational telemetry, error analysis, and debugging.

Your system is a complex machine with lots of moving parts. It is difficult to know which information to collect and which to ignore, so err on the side of caution and collect as much as possible.

Push Versus Pull Observability and Resolution

Some monitoring and observability tools take a pull-based approach, where centralized infrastructure pulls data from services at an interval, and some monitoring infrastructure expects events about the status of different nodes to push that information to it. Many of the tools that we’ll look at in this lesson can work in one fashion or the other—or sometimes both. It’s up to you to decide upon which approach you’d like.

For a lot of organizations, the discussion is one of resolution. How often do you update monitoring infrastructure? In a dynamic environment, things may come and go as they need to. Indeed, the life span of a service might be only seconds or minutes when we talk about ad hoc tasks. If a system employs pull-based monitoring, the interval between pulls may be longer than the entire span of a running application! The monitoring infrastructure is effectively blind to entire running components, and could possibly miss out on major peaks and valleys in the data. This is one strong reason to embrace push-based monitoring for these kinds of components.

Here, we benefit considerably from Spring’s flexibility: it often provides events that we can use to trigger monitoring events. As you read this lesson, ask yourself whether a given approach is pull- or push-based, and ask how you could conceivably turn something pull- to push-based if needed.

Capturing an Application’s Present Status with Spring Boot Actuator

The present status of an application is the kind of information you would project onto a dashboard, perhaps then visualized on a giant screen in the office where people can see it. You may or may not keep all of this information for later use. Present-state status is like the speedometer in a car: it should tell you as concisely and quickly as possible whether there’s trouble or not.

If you had to distill your system’s state into a visualization (red to indicate danger, green to indicate that everything is all right, or yellow to indicate that something is perhaps amiss but within tolerable ranges), what information would you choose? That’s present-state information.

Present status might include information like memory, thread pools, database connections, and total requests processed. It might include statistics like requests per second (for all parts of the system that have requests, including HTTP endpoints and message queues), 95% percentiles for response times, errors encountered, and the state of circuit breakers.

The Spring Boot Actuator framework provides out-of-the-box support for surfacing information about the application through endpoints. Endpoints collect information and sometimes interact with other subsystems. These endpoints may be viewed a number of different ways (using REST or JMX, for example). We’ll focus on REST endpoints. Endpoints are pluggable, and various Spring Boot-based subsystems often contribute additional endpoints where appropriate. To use Spring Boot Actuator, add org.springframework.boot : spring-boot-starter-actuator to your project’s build. Add org.springframework.boot : spring-boot-starter-web to have the custom endpoints exposed as REST endpoints.

Tip

From the Spring Boot documentation: “an actuator is a manufacturing term, referring to a mechanical device for moving or controlling something. Actuators can generate a large amount of motion from a small change.”

Table 1-1 lists a few of the Actuator endpoints.

Table 1-1. A few of the Actuator endpoints
Endpoint Usage

/info

Exposes information about the current service.

/metrics

Exposes quantifiable values about the service.

/beans

Exposes a graph of all the objects that Spring Boot has created for you.

/configprops

Exposes information about all the properties available to configure the current Spring Boot application.

/mappings

Exposes all the HTTP endpoints that Spring Boot is aware of in this application as well as any other metadata (such as specified content-types or HTTP verbs in the Spring MVC mapping).

/health

A description of the state of components in the system: UP, DOWN, etc. Also returns HTTP status codes.

/loggers

Shows and modifies the loggers in the application.

/auditevents

Shows all the AuditEvent instances that have been recorded by the AuditEventRepository. These are events that connect authenticated Principal entities to events in the system. You can capture and emit custom events, as well.

/cloudfoundryapplication

Exposes information from a Cloud Foundry-based management UI to be augmented with Spring Boot Actuator information. An application status page might include the full Spring Boot /health output, in addition to “running” or “stopped.” This information is secured and requires a valid token from Cloud Foundry’s UAA authentication and authorization service. If your application is not running on Cloud Foundry, you can disable this endpoint with management.cloudfoundry.enabled=false.

/env

Returns all of the known environment properties, such as those in the operating system’s environment variables or the results of System.getProperties().

Let’s look at some of these endpoints in a bit more depth.

Metrics

Everybody generalizes from too few data points. At least I do.

Parand Darugar

Metrics are numbers. In the Spring Boot Actuator framework, there are three kinds of metrics: public metrics (which we’ll look at shortly), gauges, and counters. A gauge records a single value, all at once, and requires no tabulation. A counter records a delta (an increment or decrement); it is a value that is reached over time, with tabulation. Metrics are numbers and so are easy to store, graph, and query. There are some metrics that operations, and operations alone, will care about: host-specific information like RAM use, disk space, and requests per second. Everybody in the organization will care about semantic metrics: how many orders were made in the last hour, how many orders were placed, how many new account sign-ups have occurred, which products were sold and how many, etc. By default, Spring Boot exposes these metrics at /metrics (Example 1-1).

Example 1-1. Metrics from an application’s /metrics endpoint
{
   "classes" : 9731,
   "heap.committed" : 570368,
   "nonheap.used" : 72430,
   "systemload.average" : 3.328125,
   "gauge.response.customers.id" : 7,
   "gc.ps_marksweep.count" : 2,
   "nonheap" : 0,
   "counter.status.200.customers" : 1, 1
   "counter.status.200.customers.id" : 2,
   "mem.free" : 390762,
   "heap.used" : 179605,
   "classes.unloaded" : 0,
   "gauge.response.star-star.favicon.ico" : 4,
   "instance.uptime" : 47231,
   "counter.status.200.star-star.favicon.ico" : 2,
   "threads.peak" : 21,
   "nonheap.init" : 2496,
   "threads.totalStarted" : 27,
   "mem" : 642797,
   "httpsessions.max" : -1,
   "counter.customers.read.found" : 2,
   "gc.ps_marksweep.time" : 96,
   "uptime" : 52379,
   "threads" : 21,
   "customers.count" : 6,
   "gc.ps_scavenge.count" : 6,
   "heap.init" : 262144,
   "httpsessions.active" : 0,
   "nonheap.committed" : 74112,
   "gc.ps_scavenge.time" : 87,
   "counter.status.200.admin.metrics" : 2,
   "datasource.primary.usage" : 0,
   "processors" : 8, 2
   "gauge.response.customers" : 9,
   "heap" : 3728384,
   "gauge.response.admin.metrics" : 4,
   "threads.daemon" : 19,
   "datasource.primary.active" : 0,
   "classes.loaded" : 9731
}
1

The metrics include counts of requests, their paths, and their HTTP status codes already. Here, 200 is the HTTP status code.

2

The Spring Boot Actuator also captures other salient information about the system, like how many processors are available.

The metrics already record a lot of useful information for us: they record all requests made (and the corresponding HTTP status code), and information about the environment (like the JVM’s threads, loaded classes, and information about any configured DataSource instances). Spring Boot conditionally registers metrics based on the subsystems in play.

You can contribute your own metrics using the org.springframework.boot.actuate.metrics.CounterService to record deltas (“one more request has been made”) or the org.springframework.boot.actuate.metrics.GaugeService to capture absolute values (“there are 140 users connected to the chat room”). See Example 1-2.

Example 1-2. Collecting customer metrics with the CounterService
package demo.metrics;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.metrics.CounterService;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.servlet.support.ServletUriComponentsBuilder;

import java.net.URI;

@RestController
@RequestMapping("/customers")
public class CustomerRestController {

 private final CounterService counterService; 1

 private final CustomerRepository customerRepository;

 @Autowired
 CustomerRestController(CustomerRepository repository,
  CounterService counterService) {
  this.customerRepository = repository;
  this.counterService = counterService;
 }

 @RequestMapping(method = RequestMethod.GET, value = "/{id}")
 ResponseEntity<?> get(@PathVariable Long id) {
  return this.customerRepository.findById(id).map(customer -> {
   String metricName = metricPrefix("customers.read.found");
   this.counterService.increment(metricName); 2
   return ResponseEntity.ok(customer);
  }).orElseGet(() -> {
   String metricName = metricPrefix("customers.read.not-found");
   this.counterService.increment(metricName); 3
   return ResponseEntity.class.cast(ResponseEntity.notFound().build());
  });
 }

 @RequestMapping(method = RequestMethod.POST)
 ResponseEntity<?> add(@RequestBody Customer newCustomer) {
  this.customerRepository.save(newCustomer);
  ServletUriComponentsBuilder url = ServletUriComponentsBuilder
   .fromCurrentRequest();
  URI location = url.path("/{id}").buildAndExpand(newCustomer.getId()).toUri();
  return ResponseEntity.created(location).build();
 }

 @RequestMapping(method = RequestMethod.DELETE)
 ResponseEntity<?> delete(@PathVariable Long id) {
  this.customerRepository.delete(id);
  return ResponseEntity.notFound().build();
 }

 @RequestMapping(method = RequestMethod.GET)
 ResponseEntity<?> get() {
  return ResponseEntity.ok(this.customerRepository.findAll());
 }

 4
 protected String metricPrefix(String k) {
  return k;
 }

}
1

The CounterService is auto-configured for you. If you’re using Java 8, you’ll get a better performing implementation than on an earlier Java version.

2

Record how many requests resulted in a successful match…

3

…and how many were a miss.

4

We’ll override this method in the next example to change the key used for the metric.

The CounterService and GaugeService capture metrics as they’re updated, in transaction. In order for us to capture metrics, we need to update code to emit the correct metrics when an event worth observing happens. It’s not always easy to insert instrumentation into the request path of our business components; perhaps it would be easier to capture those metrics retroactively. The Spring Boot Actuator org.springframework.boot.actuate.endpoint.PublicMetrics interface supports centralizing metrics collection. Spring Boot provides implementations of this interface internally to surface information about the JVM environment, Apache Tomcat, the configured DataSource, etc. In Example 1-3, we will look at an example that captures information about customers in our application.

Example 1-3. A custom PublicMetrics implementation exposing how many customers exist in the system
package demo.metrics;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.endpoint.PublicMetrics;
import org.springframework.boot.actuate.metrics.Metric;
import org.springframework.stereotype.Component;

import java.util.Collection;
import java.util.HashSet;
import java.util.Set;

@Component
class CustomerPublicMetrics implements PublicMetrics {

 private final CustomerRepository customerRepository;

 @Autowired
 public CustomerPublicMetrics(CustomerRepository customerRepository) {
  this.customerRepository = customerRepository;
 }

 @Override
 public Collection<Metric<?>> metrics() {

  Set<Metric<?>> metrics = new HashSet<>();

  long count = this.customerRepository.count();

  1
  Metric<Number> customersCountMetric = new Metric<>("customers.count", count);
  metrics.add(customersCountMetric);
  return metrics;
 }
}
1

This metric reports the aggregate count of Customer records in the database.

So far we’ve looked at metrics as fixed-point-in-time quantities. They represent a value as of the moment you review it. These are useful, but they don’t have context. For some values, a fixed point-in-time value is pointless. Absent history—the perspective of time—it’s hard to know whether a value represents an improvement or a regression. Given the axis of time, we can take a value and derive statistics: averages, medians, means, percentiles, etc.

The Spring Boot Actuator seamlessly integrates with the Dropwizard Metrics library. Coda Hale developed the Dropwizard Metrics library while at Yammer to capture gauges, counters, and a handful of other types of metrics. Add io.dropwizard.metrics : metrics-core to your classpath to get started.

The Dropwizard Metrics library includes support for meters. A meter measures the rate of events over time (e.g., “orders per second”). If the Dropwizard Metrics library is on the classpath, and you prefix any metric captured through the CounterService or GaugeService with meter., the Spring Boot Actuator framework will delegate to the Dropwizard Metrics Meter implementation to calculate and persist this metric.

From the Dropwizard Metrics documentation: “Meters measure the rate of the events in a few different ways. The mean rate is the average rate of events. It’s generally useful for trivia, but as it represents the total rate for your application’s entire lifetime (e.g., the total number of requests handled, divided by the number of seconds the process has been running), it doesn’t offer a sense of recency. Luckily, meters also record three different exponentially-weighted moving average rates: the 1-, 5-, and 15-minute moving averages.”

The Meter doesn’t need to retain all values it records as it uses an exponentially weighted moving average; it retains a set of samples of values, over time. This makes it memory-efficient even over great periods of time. Let’s revise our example to use a Meter and then review the effect on the recorded metrics. We’ll simply extend the CustomerRestController that we looked at in Example 1-3, overriding the metric​Prefix method to prefix all CounterService metrics with meter., instead (see Examples 1-4 and 1-5).

Example 1-4. Collecting metered metrics with the CounterService and the Dropwizard Metrics library
package demo.metrics;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.metrics.CounterService;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
@RequestMapping("/metered/customers")
public class MeterCustomerRestController extends CustomerRestController {

 @Autowired
 MeterCustomerRestController(CustomerRepository repository,
  CounterService counterService) {
  super(repository, counterService);
 }

 @Override
 protected String metricPrefix(String k) {
  return "meter." + k; 1
 }
}
1

Prefix all recorded metrics with meter.

Example 1-5. Metrics powered by the Dropwizard Meter
{
   "meter.customers.read.fifteenMinuteRate" : 0.102683423806518,
   "meter.customers.read.meanRate" : 0.00164167411117908,
   "meter.customers.read.fiveMinuteRate" : 0.0270670566473226,
   "meter.customers.read.oneMinuteRate" : 9.07998595249702e-06
   ...
}

A counter and gauge capture a single value. A meter captures the number of values over a time period. A meter does not tell us anything about the frequencies of values in a data set. A histogram is a statistical distribution of values in a stream of data: it shows how many times a certain value occurs. It lets you answer questions like “What percent of orders have more than one item in the cart?” The Dropwizard Metrics Histogram measures the minimum, maximum, mean, and median, as well as percentiles: 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles.

If you’ve taken a basic statistics class, you know that we need all data points in order to derive these values with perfect accuracy. This can be an overwhelming amount of data, even over a small period of time. Suppose your application saw 1,000 logical transactions a second and that there were 10 requests, or actions, per request. Over a day, that’s 864,000,000 values (24×60×60×1000×10)! If we’re using Java, that’s eight bytes per long, and more than six gigabytes of data per day! Many applications won’t have this much traffic, but many applications will. Either way, it’s not hard to scale it up and down and see that it’s eventually going to be a problem for you.

The Dropwizard Metrics library uses reservoir sampling to keep a statistically representative sample of measurements as they happen. As time progresses, the Dropwizard Histogram creates samples of older values and uses those samples to derive new samples going forward. The result isn’t perfect (it’s lossy), but it’s efficient. The Spring Boot Actuator framework will automatically convert any value submitted using the GaugeService with a metric key prefix of histogram. into a Dropwizard Histogram if the Dropwizard Metrics library is on the classpath.

Let’s look at Example 1-6, which captures histograms for file uploads.

Example 1-6. Collecting a distribution of file upload sizes using the Dropwizard Metrics histogram implementation
package demo.metrics;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.metrics.GaugeService;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;

@RestController
@RequestMapping("/histogram/uploads")
public class HistogramFileUploadRestController {

 private final GaugeService gaugeService;

 private Log log = LogFactory.getLog(getClass());

 @Autowired
 HistogramFileUploadRestController(GaugeService gaugeService) {
  this.gaugeService = gaugeService;
 }

 @RequestMapping(method = RequestMethod.POST)
 void upload(@RequestParam MultipartFile file) {
  long size = file.getSize();
  this.log.info(String.format("received %s with file size %s",
   file.getOriginalFilename(), size));
  this.gaugeService.submit("histogram.file-uploads.size", size); 1
 }
}
1

Prefix all recorded metrics with histogram. to have them converted into Dropwizard Histogram instances behind the scenes.

Example 1-6 maintains a histogram for file upload sizes. We used curl to upload three different files of varying size, randomly (see Table 1-2).

Table 1-2. The sample files and their sizes
File Size Filename Frequency

8.0K

${HOME}/Desktop/1.png

2

32K

${HOME}/Desktop/2.png

5

40K

${HOME}/Desktop/3.png

3

The /metrics confirm what the data table tells us (Example 1-7).

Example 1-7. Metrics powered by the Dropwizard histogram
{
   ...
   "histogram.file-uploads.size.snapshot.98thPercentile" : 38803,
   "histogram.file-uploads.size.snapshot.999thPercentile" : 38803,
   "histogram.file-uploads.size.snapshot.median" : 29929,
   "histogram.file-uploads.size.snapshot.mean" : 27154.1998413605,
   "histogram.file-uploads.size.snapshot.75thPercentile" : 38803,
   "histogram.file-uploads.size.snapshot.min" : 6347,
   "histogram.file-uploads.size.snapshot.max" : 38803,
   "histogram.file-uploads.size.count" : 10,
   "histogram.file-uploads.size.snapshot.95thPercentile" : 38803,
   "histogram.file-uploads.size.snapshot.99thPercentile" : 38803,
   "histogram.file-uploads.size.snapshot.stdDev" : 11639.4103925448,
   ...
}

The Dropwizard Metrics library also supports timers. A timer measures the rate that a particular piece of code is called and the distribution of its duration. It answers the question: how long does it usually take for a given type of request to be run, and what are atypical durations? The Spring Boot Actuator framework will automatically convert any metric starting with timer. that’s submitted using the GaugeService into a Timer.

You can time requests using a variety of mechanisms. Spring itself ships with the venerable StopWatch class, which is perfect for our purposes (Example 1-8).

Example 1-8. Capturing timings using the Spring StopWatch
package demo.metrics;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.metrics.GaugeService;
import org.springframework.http.ResponseEntity;
import org.springframework.util.StopWatch;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class TimedRestController {

 private final GaugeService gaugeService;

 @Autowired
 public TimedRestController(GaugeService gaugeService) {
  this.gaugeService = gaugeService;
 }

 @RequestMapping(method = RequestMethod.GET, value = "/timer/hello")
 ResponseEntity<?> hello() throws Exception {
  StopWatch sw = new StopWatch(); 1
  sw.start();
  try {
   Thread.sleep((long) (Math.random() * 60) * 1000);
   return ResponseEntity.ok("Hi, " + System.currentTimeMillis());
  }
  finally {
   sw.stop();
   this.gaugeService.submit("timer.hello", sw.getLastTaskTimeMillis());
  }
 }

}
1

This is the Spring framework StopWatch, which we use here to count how long a request took.

The timer gives us all that a histogram does and then some, specifically for durations (Example 1-9).

Example 1-9. The results of the Timer in the metrics output
{
   ...
   "timer.hello.snapshot.stdDev" : 11804,
   "counter.status.200.timer.hello" : 7,
   "timer.hello.snapshot.75thPercentile" : 35004,
   "timer.hello.meanRate" : 0.0561559793104086,
   "timer.hello.snapshot.mean" : 27639,
   "timer.hello.snapshot.min" : 2007,
   "timer.hello.snapshot.max" : 42003,
   "timer.hello.snapshot.median" : 35004,
   "timer.hello.snapshot.98thPercentile" : 42003,
   "timer.hello.fifteenMinuteRate" : 0.182311662062598,
   "timer.hello.snapshot.99thPercentile" : 42003,
   "timer.hello.snapshot.999thPercentile" : 42003,
   "timer.hello.oneMinuteRate" : 0.0741487724647533,
   "timer.hello.fiveMinuteRate" : 0.153231174431025,
   "timer.hello.count" : 7,
   "gauge.response.timer.hello" : 28008,
   "timer.hello.snapshot.95thPercentile" : 42003
...
}

The Dropwizard Metrics library enriches the Spring Boot metrics subsystem. It gives us access to a bevy of statistics that we wouldn’t have otherwise had.

Joined-up views of metrics

Thus far our examples have all run on a single node or host; one instance, one host. It will become critical to centralize the metrics from across all services and instances as we scale out. We can use a time series database (TSDB) to centrally collect and store metrics. A time series database is a database that’s optimized for the collection, analysis and, sometimes, visualization of metrics over time. There are many popular time series databases, like Ganglia, Graphite, OpenTSDB, InfluxDB, and Prometheus. A time series database stores values for a given key over a period of time. Usually, they work in tandem with something that supports graphing of the data in the time series database. There are many fine technologies for graphing time series data, the most popular of which seems to be Grafana. Alternative visualization technologies abound. Many companies have open sourced their visualization tools, such as: Vimeo’s Graph Explorer, TicketMaster’s Metrilyx, and Square’s Cubism.js.

Spring Boot supports writing metrics to a time series database using implementations of the MetricsWriter interface. Out of the box, Spring Boot can publish metrics to a Redis instance, JMX (possibly more useful for development), out over a Spring framework MessageChannel, or to any service that speaks the StatsD protocol. StatsD was a proxy originally written in Node.js by folks at Etsy to act as a proxy for Graphite/Carbon, but the protocol itself has become so popular that many clients and services now speak the protocol, and StatsD itself supports multiple backend implementations besides Graphite, including InfluxDB, OpenTSDB, and Ganglia. To take advantage of the various MetricWriter implementations, simply define a bean of the appropriate type and annotate it with @ExportMetricWriter. Dropwizard also supports publishing metrics to downstream systems through its *Reporter implementations, if StatsD doesn’t suit your use case or you want more control.

Metric data dimensions

The data you create in a time series database is still data, even if the dimensions of that data would seem to be very limited: a metric has a key and a value, at least. Therein lies the rub; there’s very little in the way of schema. Different time series databases provide improvements and offer other dimensions of data. Some offer the notion of labels or tags. The only dimension common to all time series databases, however, is a key. Most implementations we’ve seen use hierarchical keys (e.g., a.b.c.). Many time series databases support glob queries (e.g., a.b.*), which will return all metrics that match the prefix a.b. Each path component in a key should have a clear, well-defined purpose, and volatile path components should be kept as deep into the hierarchy as possible. Design your keys and metrics to support what will certainly be a growing number of metrics. Think carefully about what you capture in your metrics, be it through hierarchical keys or other dimensions like labels and tags.

How will you encode requests across different products in the same time series database? Capture the component name, e.g., order-service.

How will you encode different types of activities or processes? HTTP requests, messaging-based back-office requests, back-office batch jobs, or something else? order-service.tasks.fulfillment.validate-shipping or order-⁠ser⁠vice⁠.req​uests.new-order?

Consider how you will encode the information so that it could ultimately be correlated to product management-facing systems like HP’s Vertica or Facebook’s Scuba? While we’d like to claim that all operational telemetry directly translates into business metrics, it’s just not true. It can be useful, though, to have a way of capturing this information and connecting it. You can do this up front in the metrics’ keys themselves, or perhaps using labels and tags.

How will you correlate requests to A/B tests (or experiments) where a sample of the population runs through a code path for a given feature that behaves differently than the majority of requests. This helps to gauge whether a feature works, and is well-received. The implication here is that you may have the same metric in two different code paths, one a different, experimental alternative to the other. Many organizations have a system for experiments, and surface experiment numbers that should be incorporated into the metrics.

As always, schema is a subjective matter, and new time series databases differentiate on the richness of the data collected and the scale! Some time series databases are lossy, while others scale horizontally, almost infinitely. You should choose your time series database just like any other database, carefully considering the opportunities for schema design. On the one hand, it should be friction-free for developers to capture metrics. On the other hand, some forethought can go a long way later.

Shipping metrics from a Spring Boot application

You can readily find hosted versions of many of these time series databases. One is Hosted Graphite, which is easy to integrate with. It preconfigures Graphite, Graphite Composer, and Grafana. Grafana and Graphite Composer both let you build graphs based on collected metrics. You can, of course, run your own Graphite instance, but keep in mind the goal is, as always, to get to production as quickly as possible, so we tend to prefer cloud-based services (Software as a Service, or SaaS); we don’t like to run software unless we can sell it. There are a few worthy options for connecting to Graphite available to the Spring Boot developer. You can use Spring Boot’s StatsdMetricWriter, which speaks the StatsD protocol and works with many of the aforementioned backends. You can also use one of the myriad native Dropwizard Metrics reporter implementations if, for example, you’d prefer to use a native protocol besides StatsD. We’ll do that here to communicate natively with Graphite, as demonstrated in Example 1-10.

Example 1-10. Configuring a Dropwizard Metrics GraphiteReporter instance
package demo.metrics;

import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.graphite.Graphite;
import com.codahale.metrics.graphite.GraphiteReporter;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import javax.annotation.PostConstruct;
import java.util.concurrent.TimeUnit;

@Configuration
class ActuatorConfiguration {

 ActuatorConfiguration() {
  java.security.Security.setProperty("networkaddress.cache.ttl", "60"); 1
 }

 @Bean
 GraphiteReporter graphiteWriter(
  @Value("${hostedGraphite.apiKey}") String apiKey,
  @Value("${hostedGraphite.url}") String host,
  @Value("${hostedGraphite.port}") int port, MetricRegistry registry) {

  GraphiteReporter reporter = GraphiteReporter.forRegistry(registry)
   .prefixedWith(apiKey) 2
   .build(new Graphite(host, port));
  reporter.start(1, TimeUnit.SECONDS);
  return reporter;
 }

}
1

Prevent DNS caching, because HostedGraphite.com nodes may move and be mapped to a new DNS route.

2

HostedGraphite.com’s service establishes authentication using an API key and expects that you transmit it as part of the prefix field. This is, admittedly, a bit of an ugly hack, but there is otherwise no obvious place to put the authentication information!

Drive traffic to the CustomerRestController or MeterCustomerRestController, and you’ll see the traffic reflected in graphs that you can create on HostedGraphite.com in either the Grafana interface or the Graphite composer interface, depicted in Figures 1-1 and 1-2.

a Graphite Composer dashboard
Figure 1-1. The Graphite Composer dashboard
a Grafana dashboard
Figure 1-2. The Grafana dashboard

Identifying Your Service with the /info Endpoint

Ideally, you’re developing your code in a continuous delivery pipeline. In a continuous delivery pipeline, every commit could result in a push to production. Ideally, this happens many times a day. If something goes wrong, the first thing people will want to know is which version of the code is running. Give your service the ability to identify itself using the /info endpoint.

The /info endpoint is intentionally left blank. The /info endpoint is a natural place to put information about the service itself. What’s the service name? Which git commit triggered the build that ultimately resulted in the push to production? What is the service version?

You can contribute custom properties by prefixing properties with info. in the environment through the normal channels (application.properties, application.yml, etc.). You can also surface information about the state of the git source code repository when the project was built by adding in the pl.project13.maven : git-commit-id-plugin Maven plug-in. This plug-in is preconfigured for Maven users in the Spring Boot parent Maven build. It generates a file, git.properties, containing the git.branch and git.commit properties. The /info endpoint will know to look for it if it’s available (Example 1-11).

Example 1-11. The Git branch, and commit id and time exposed from /info
{
  git: {
    branch: "master",
    commit: {
      id: "407359e",
      time: "2016-03-23T00:47:09+0100"
    }
  }
  ...
}

It’s very simple to add custom properties, as well. Any environment property prefixed with info. will be added to the output of this endpoint. Spring Boot’s default Maven plug-in configuration, for example, is already set up to handle Maven resource filtering. You can take advantage of Maven resource filtering to emit custom properties captured at build time, like the Maven project.artifactId and project.version (Example 1-12).

Example 1-12. Capturing custom build-time information like the project’s artifactId and version with the /info endpoint by contributing properties during the build with Maven resource filtering
info.project.version=@project.version@
info.project.artifactId=@project.artifactId@

Once that’s done, bring up your /info endpoint and identify what’s happening (Example 1-13).

Example 1-13. Capturing custom build-time information like the project’s artifactId and version with the /info endpoint
{
  ...
  project: {
    artifactId: "actuator",
    version: "1.0.0-SNAPSHOT"
  }
}

Health Checks

An application needs a way of volunteering its health to infrastructure. A good health check should provide an aggregate status that sums up the reported statuses for individual components in play. Health checks are often used by load balancers to determine the viability of a node. Load balancers may evict nodes based on the HTTP status code returned. The org.springframework.boot.actuate.endpoint.HealthEndpoint collects all org.springframework.boot.actuate.health.HealthIndicator implementations in the application context and exposes them. Example 1-14 shows the output of the default /health endpoint in our sample application.

Example 1-14. The output of the default /health endpoint for our sample application
{
  status: "UP",
  diskSpace: {
    status: "UP",
    total: 999334871040,
    free: 735556071424,
    threshold: 10485760
  },
  redis: {
    status: "UP",
    version: "3.0.7"
  },
  db: {
    status: "UP",
    database: "H2",
    hello: 1
  }
}

Spring Boot automatically registers common HealthIndicator implementations based on various auto-configurations for JavaMail, MongoDB, Cassandra, JDBC, SOLR, Redis, ElasticSearch, the filesystem, and more. These health indicators are all the things your service works with that may fail, independently of your service. In the example above we see that Redis, the filesystem, and our JDBC DataSource are all automatically accounted for.

Let’s contribute a custom HealthIndicator. The contract for a HealthIndicator is simple: when asked, return a Health instance with the appropriate status. Other components in the system need to be able to influence the returned Health object. You could directly inject the relevant HealthIndicator and manipulate its state in every component that might affect that state, but this couples a lot of application code to a secondary concern, the health status. An alternative approach is to use Spring’s ApplicationContext event bus to publish events within components and manipulate the HealthIndicator based on acknowledged events.

In the following code blocks, we’ll establish an emotional health indicator (Example 1-15) that is happy (UP) or sad (DOWN) when it receives a SadEvent (Example 1-16) or a HappyEvent (Example 1-17), accordingly.

Example 1-15. An emotional HealthIndicator
package demo.health;

import org.springframework.boot.actuate.health.AbstractHealthIndicator;
import org.springframework.boot.actuate.health.Health;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;

import java.util.Date;
import java.util.Optional;

@Component
class EmotionalHealthIndicator extends AbstractHealthIndicator {

 private EmotionalEvent event;

 private Date when;

 1
 @EventListener
 public void onHealthEvent(EmotionalEvent event) {
  this.event = event;
  this.when = new Date();
 }

 2
 @Override
 protected void doHealthCheck(Health.Builder builder) throws Exception {
//@formatter:off
  Optional
   .ofNullable(this.event)
   .ifPresent(
    evt -> {
     Class<? extends EmotionalEvent> eventClass = this.event.getClass();
     Health.Builder healthBuilder = eventClass
             .isAssignableFrom(SadEvent.class) ? builder
      .down() : builder.up();
     String eventTimeAsString = this.when.toInstant().toString();
     healthBuilder.withDetail("class", eventClass).withDetail("when",
      eventTimeAsString);
    });
//@formatter:off
 }

}
1

We’ll connect this listener method to ApplicationContext events using the @EventListener annotation.

2

The doHealthCheck method uses the Health.Builder to toggle the state of the health indicator based on the last known, recorded EmotionalEvent.

Example 1-16. The SadEvent
package demo.health;

public class SadEvent extends EmotionalEvent {
}
Example 1-17. The HappyEvent
package demo.health;

public class HappyEvent extends EmotionalEvent {
}

Now, any component in the ApplicationContext needs only publish an appropriate event to trigger an according status change (Example 1-18).

Example 1-18. The emotional REST endpoint
package demo.health;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
class EmotionalRestController {

 private final ApplicationEventPublisher publisher; 1

 @Autowired
 EmotionalRestController(ApplicationEventPublisher publisher) {
  this.publisher = publisher;
 }

 @RequestMapping("/event/happy")
 void eventHappy() {
  this.publisher.publishEvent(new HappyEvent()); 2
 }

 @RequestMapping("/event/sad")
 void eventSad() {
  this.publisher.publishEvent(new SadEvent());
 }
}
1

Spring automatically exposes an implementation of the ApplicationEventPublisher interface for use in component code. Indeed, the ApplicationContext Spring is using to run your application probably already is an ApplicationEventPublisher.

2

From there it’s trivial to dispatch an event between components.

Events make it easier to support operational information without slowing down the request path of business logic unconcerned with, for example, a health endpoint.

Audit Events

Events are a great way to capture almost anything in a system. Spring Boots supports using events to support auditing, which ties events in the application to the authenticated users that triggered them, through audit events. Let’s look at a REST endpoint that also has Spring Security (org.springframework.boot : spring-boot-starter-security) on the classpath. In order to get a trivial demo working, we’ve configured a custom UserDetailsService implementation (Example 1-19).

Example 1-19. A few hardcoded users
package com.example;

import org.springframework.security.core.authority.AuthorityUtils;
import org.springframework.security.core.userdetails.User;
import org.springframework.security.core.userdetails.UserDetails;
import org.springframework.security.core.userdetails.UserDetailsService;
import org.springframework.security.core.userdetails.UsernameNotFoundException;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.Optional;
import java.util.Set;
import java.util.concurrent.ConcurrentSkipListSet;

@Service
class SimpleUserDetailsService implements UserDetailsService {

 private final Set<String> users = new ConcurrentSkipListSet<>();

 SimpleUserDetailsService() {
  1
  this.users.addAll(Arrays.asList("pwebb", "dsyer", "mbhave", "snicoll",
   "awilkinson"));
 }

 @Override
 public UserDetails loadUserByUsername(String s)
  throws UsernameNotFoundException {
  2
  return Optional
   .ofNullable(this.users.contains(s) ? s : null)
   .map(x -> new User(x, "pw", AuthorityUtils.createAuthorityList("ROLE_USER")))
   .orElseThrow(() -> new UsernameNotFoundException("couldn't find " + s + "!"));
 }
}
1

We hardcode a list of users (dsyer, pwebb, etc.) and…

2

…passwords (pw, for every user—don’t try this at home!) with a fixed role (ROLE_USER).

The Spring Boot auto-configuration locks down the HTTP endpoints using HTTP BASIC authentication, by default. Spring Security will generate events related to authentication and authorization: whether someone has authenticated (or tried, unsuccessfully), whether someone has signed out, etc. You can also create your own audit events. Let’s look at a trivial HTTP endpoint example that relies upon the fact that Spring Security will make available the currently authenticated java.security.Principal for injection into Spring MVC handler methods (Example 1-20).

Example 1-20. A trivial (but secure) HTTP endpoint
package com.example;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.actuate.audit.AuditEvent;
import org.springframework.boot.actuate.audit.AuditEventRepository;
import org.springframework.boot.actuate.audit.listener.AuditApplicationEvent;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

import java.security.Principal;
import java.util.Collections;

@RestController
class GreetingsRestController {

 public static final String GREETING_EVENT = "greeting_event".toUpperCase();

 private final ApplicationEventPublisher appEventPublisher;

 @Autowired
 GreetingsRestController(ApplicationEventPublisher appEventPublisher) {
  this.appEventPublisher = appEventPublisher;
 }

 @GetMapping("/hi")
 String greet(Principal p) { 1
  String msg = "hello, " + p.getName() + "!";

  AuditEvent auditEvent = new AuditEvent(p.getName(), 2
   GREETING_EVENT, 3
   Collections.singletonMap("greeting", msg)); 4

  this.appEventPublisher.publishEvent( 5
   new AuditApplicationEvent(auditEvent));

  return msg;
 }

}
1

Inject the currently authenticated Principal and…

2

…use it to create an AuditEvent, dereferencing the authenticated Principal name…

3

…along with an event name (it’s arbitrary, use something meaningful to your system) and…

4

…any extra metadata you’d like included in the log.

5

Finally, use the Spring application context event mechanism to dispatch the Audit​Event with a wrapper, AuditApplicationEvent.

You can call the secure endpoint, authenticating using HTTP BASIC. We’re using the friendy httpie client, but you’re free to use whatever you want (Example 1-21).

Example 1-21. The output of the /auditevents Actuator HTTP endpoint
{
   "events" : [
      {
         "timestamp" : "2017-04-26T14:01:10+0000",
         "principal" : "dsyer",
         "type" : "AUTHENTICATION_SUCCESS",
         "data" : {
            "details" : {
               "sessionId" : null,
               "remoteAddress" : "127.0.0.1"
            }
         }
      },
      {
         "timestamp" : "2017-04-26T14:01:10+0000",
         "data" : {
            "greeting" : "hello, dsyer!"
         },
         "type" : "GREETING_EVENT",
         "principal" : "dsyer"
      }
   ]
}

The audit events mechanism makes it trivial to capture information about users in the system. Spring Boot has a component called AuditListener that listens for events and records them using an AuditEventRepository implementation (Example 1-22). By default, this implementation is in-memory, though it would be trivial to implement it using some sort of persistent backing store. Contribute your own implementation and Spring Boot will honor that instead.

You can also listen (and react) to audit events in your own code using the same event listener mechanisms as you would any other Spring events.

Example 1-22. A simple AuditApplicationEvent listener
package com.example;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.boot.actuate.audit.AuditEvent;
import org.springframework.boot.actuate.audit.listener.AbstractAuditListener;
import org.springframework.boot.actuate.audit.listener.AuditApplicationEvent;
import org.springframework.context.ApplicationListener;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;

@Component
class SimpleAuditEventListener {

 private Log log = LogFactory.getLog(getClass());

 @EventListener(AuditApplicationEvent.class)
 public void onAuditEvent(AuditApplicationEvent event) {
  this.log.info("audit-event: " + event.toString());
 }
}

Application Logging

Here we are, the future! We have self-driving cars and houses that talk to us. We can spin up a thousand servers in the blink of an eye! And yet, the venerable logfile remains one of the best ways we have to understand a given node or system’s behavior. Logs reflect a process’s rhythm, its activity. Logging requires intrusive statements be added to the code, but the resulting logfiles themselves are one of the most decoupled tools out there. There are entire ecosystems, tools, languages, and big-data platforms that have developed entirely around usefully mining data from logs.

You undertake two decisions when logging:

Log output

Where do you want the log output to appear? In a file? On the console? In a SyslogD service?

Log levels

What granularity of output do you want? Do you want every little hiccup to be printed out, or just the things that may threaten the world?

Specifying Log Output

Logs appear by default on the console in a Spring Boot application. You can optionally configure writing out to a file or some other log appender, but the console is a particularly sensible default.

If you’re doing development, you’ll want to see the logs as they arrive, and if you’re running your application in a cloud environment, then you shouldn’t need to worry about where the logs get routed to. This is one of the tenets of the twelve-factor manifesto:

A twelve-factor app never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app’s behavior.

Log collectors or log multiplexers, like Cloud Foundry’s Loggregator or Logstash, take the resulting logs from disparate processes and unify them into a single stream, possibly forwarding that stream onward to someplace where it may be analyzed. Log data should be as structured as possible—use consistent delimiters, define groups, and support something like a log schema—to support analysis. Logs should be treated as event streams; they tell a story about the behavior of the system. Log information might be the output of one process and the input to another downstream analytical process. One very popular log multiplexer, Logstash, provides numerous plug-ins that let you pipeline logs from multiple input sources and connect those logs to a central analytics system like Elasticsearch, a full-text search engine powered by Lucene.

Another log multiplexer, Loggregator, will aggregate and forward logs to your console, using cf logs $YOUR_APP_NAME, or to any SyslogD protocol-compliant service, including on-premise or hosted services like ElasticSearch (via Logstash), Papertrail, Splunk, Splunk Storm, SumoLogic or, of course, SyslogD itself. Configure a log drain as you would any user-provided service, specifying -l to signal that it’s to be a log drain, as shown in Example 1-23.

Example 1-23. Create a user-provided service
cf cups my-logs -l syslog://logs.papertrailapp.com:PORT

Then, it’s just a service that’s available for any application to bind to, as shown in Example 1-24.

Example 1-24. Binding the service to an application
cf bind-service my-app my-logs && cf restart my-app

Loggregator also publishes log messages on the websocket protocol, so it’s very simple to programmatically listen to logs coming off any Cloud Foundry applications. We’re using Java, and so benefit from the Cloud Foundry Java client’s easy integration with this websocket feed.

Pivotal Cloud Foundry also offers correlated logging (as depicted in Figure 1-3)—it’ll show you metrics about requests on a timeline, and then show you the logs for interesting periods of time on the timelines.

correlated logging on Pivotal's AppsManager
Figure 1-3. Correlated logging on Pivotal’s AppsManager

Specifying Log Levels

Logs have become such a natural extension of an application’s state that it can be dizzying and difficult to even begin to choose which logging technology to use. If you’re using Spring Boot, you’re probably fine just using the defaults. Spring Boot uses Commons Logging for all internal logging, but leaves the underlying log implementation open. Default configurations are provided for the JDK’s logging support, Log4J, Log4J2, and Logback. In each case loggers are preconfigured to use console output, with optional file output also available.

By default, Spring Boot will use Logback, which in turn can capture and forward logs produced from other logging technologies like Apache Commons Logging, Log4j, Log4J2, etc. What does this mean? It means that all the likely log producing dependencies on the classpath will work just fine out of the box, and arrive on the console in a well-known configured format that includes the date and time, log level, process ID, the thread name, the logger name, and the actual log message. It’ll even include color codes if you’re viewing the logs on a console!

You can use Spring Boot to manage log levels generically by specifying the levels in the Spring environment (as a property in application.properties or your Spring Cloud Config Server instance). Spring understands, and will appropriately map to the underlying logging provider, the following log levels: ERROR, WARN, INFO, DEBUG, or TRACE. You can also specify OFF to mute all output. Log levels are ordered by priority; it’s less important that somebody see statements in DEBUG intended to aid development than it is that they see potentially stability-threatening messages logged to ERROR. If you set a log level to ERROR, then nothing from that package except messages logged to ERROR will be visible. If you set the log level to TRACE, then all log messages from that package will be visible. A level shows all output from that level and every level below it. Let’s look at an example.

Log levels are hierarchical: if you set package a to WARN, then package a.b will also be set to WARN. There is a distinction between the configured log levels and the effective log levels. Example 1-25 shows the configuration to change the log level to ERROR for all code in the demo package.

Example 1-25. Specifying an arbitrary log level in the Spring environment
logging.level.demo=error

If you run this application, you’ll see that even though we have emitted the same message multiple times at different log levels, only one message is emitted on the console. Try it out by making an HTTP GET to http://localhost:8080/log, then change the log level to TRACE and restart the application. You’ll see the same message logged multiple times, and in different colors if your terminal supports it, in Example 1-26.

Example 1-26. A Java application that logs three messages at different log levels
package demo;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import javax.annotation.PostConstruct;
import java.util.Optional;

@SpringBootApplication
@RestController
public class LoggingApplication {

 private Log log = LogFactory.getLog(getClass());

 public static void main(String args[]) {
  SpringApplication.run(LoggingApplication.class, args);
 }

 LoggingApplication() {
  triggerLog(Optional.empty());
 }

 @GetMapping("/log")
 public void triggerLog(@RequestParam Optional<String> name) {
  String greeting = "Hello, " + name.orElse("World") + "!";
  this.log.warn("WARN: " + greeting); 1
  this.log.info("INFO: " + greeting);
  this.log.debug("DEBUG: " + greeting);
  this.log.error("ERROR: " + greeting);
 }
}
1

None, one, or all of these log messages will appear, depending on the log level you specify.

Thus far we’ve restarted the process to see the log levels updated, but we can also interrogate and dynamically, while the process is running, reconfigure log levels using the Spring Boot Actuator /loggers endpoint. If we use an HTTP GET, the endpoint shows us all the configured log levels in our application (Example 1-27).

Example 1-27. Enumerate all the log levels
{
   "loggers" : {
      ... 1
      "org.springframework.boot.actuate.endpoint" : {
         "effectiveLevel" : "INFO",
         "configuredLevel" : null
      },
      "demo" : {
         "effectiveLevel" : "ERROR",
         "configuredLevel" : "ERROR"
      }
   },
   "levels" : [
      "OFF",
      "ERROR",
      "WARN",
      "INFO",
      "DEBUG",
      "TRACE"
   ]
}
1

This excerpts only a handful of lines from thousands more lines of configuration!

You can get details for a specific logger using /loggers/{logger}, where {logger} is the name of your package or log hierarchy name. In our example, we could call /loggers/demo to confirm the configuration for this particular level. You can call /loggers/ROOT to find the root log level that informs all otherwise unspecified and more specific log levels.

You can also update configured log levels using an HTTP POST to the relevant loggers endpoint.

Example 1-28 updates the configured level for the demo package.

Example 1-28. Update a log level
curl -i -X POST -H 'Content-Type: application/json' 
  -d '{"configuredLevel": "TRACE"}' 
  http://localhost:8080/loggers/demo

This is useful enough, but it only applies to a single instance. If you’re running on the cloud and have a few instances running at the same time, it’s more useful to ratchet up or down log levels for the deployed application. If you are using Pivotal Web Services or Pivotal Cloud Foundry, this is simple. In Figures 1-4 and 1-5, we’ll peruse an application’s logs with the Pivotal AppsManager dashboard, and then we’ll reconfigure the log levels for a Spring Boot application.

Perusing an application's logs from the AppsManager dashboard
Figure 1-4. Perusing an application’s logs from the AppsManager dashboard
Reconfiguring the log levels for a Spring Boot Application on PCF or PWS
Figure 1-5. Reconfiguring the log levels for a Spring Boot Application on Pivotal Cloud Foundry or Pivotal Web Services

Distributed Tracing

There are many options to support understanding your application and performance profile. There are agent-based instrumentation technologies, things like New Relic (which integrates seamlessly with Pivotal Cloud Foundry) and App Dynamics (which also integrates seamlessly with Pivotal Cloud Foundry), which use Java agents and automatic instrumentation to give you a low-level perspective of an application’s performance behavior. These tools are worth investigation, as they can give you runtime perspective visibility into an application’s performance. APM tools can give you a cross-language and technology dashboard of an application’s end-to-end behavior from HTTP request down to low-level data source access.

Advances in technology and cloud computing have made it easier to stand up and deploy services with ease. Cloud computing enables us to automate away the pain (from days or weeks—gasp!—to minutes) associated with standing up new services. This increase in velocity in turn enables us to be more agile, to think about smaller batches of independently deployable services. The proliferation of new services complicates reasoning about systemwide and request-specific performance characteristics.

When all of an application’s functionality lives in a monolith—what we call applications written as one, large, unbroken deployable like a .war or .ear—it’s much easier to reason about where things have gone wrong. Is there a memory leak? It’s in the monolith. Is a component not handling requests correctly? It’s in the monolith. Messages getting dropped? Also probably in the monolith. Distribution changes everything.

Systems behave differently under load and at scale. The specification of a system’s behavior often diverges from the actual behavior of the system, and the actual behavior may itself vary in different contexts. It is important to contextualize requests as they transit through a system. It’s also important to be able to talk about the nature of a specific request and to be able to understand that specific request’s behavior relative to the general behavior of similar requests in the past minute, hour, day, or whatever other useful interval provides a statistically significant sampling. Context helps us establish whether a request was abnormal and whether it merits attention. You can’t trace bugs in a system until you’ve established a baseline for what normal is. How long is long? For some systems it might be microseconds; for others it might be seconds or minutes.

In this section, we’ll look at how Spring Cloud Sleuth, which supports distributed tracing, can help us establish this context and help us better understand a system’s actual behavior (its emergent behavior), not just its specified behavior.

Finding Clues with Spring Cloud Sleuth

Tracing is simple, in theory. As a request flows from one component in a system to another, through ingress and egress points, tracers add logic—instrumentation—where possible to perpetuate a unique trace ID that’s generated when the first request is made. As a request arrives at a component along its journey, a new span ID is assigned for that component and added to the trace. A trace represents the whole journey of a request, and a span is each individual hop, or request, along the way. Spans may contain tags, or metadata, that can be used to later contextualize the request and perhaps correlate a request to a specific transaction. Spans typically contain common tags like start timestamps and stop timestamps, though it’s easy to associate semantically relevant tags like a business entity ID with a span.

Let’s suppose we had two services, service-a and service-b. If an HTTP request arrived at service-a, and it in turn sent a message to Apache Kafka to service-b, then we would have one trace ID, but two spans. Each span would have request-specific tags. The first span might have details of the HTTP request. The second span might have details of the message sent to the Apache Kafka broker.

Spring Cloud Sleuth (org.springframework.cloud:spring-cloud-starter-sleuth) automatically instruments common communication channels:

  • Requests over messaging technologies like Apache Kafka or RabbitMQ (or any other messaging system for which there is a Spring Cloud Stream binder)

  • HTTP headers received at Spring MVC controllers

  • Requests that pass through a Netflix Zuul microproxy

  • Requests made with the RestTemplate, etc.

  • Requests made through the Netflix Feign REST client

  • …and indeed most other types of requests and replies that a typical Spring-ecosystem application might encounter

Spring Cloud Sleuth sets up useful log formatting for you that logs the trace ID and the span ID. Assuming you’re running Spring Cloud Sleuth-enabled code in a microservice whose spring.application.name is my-service-id, you will see something like Example 1-29 in the logs for your microservice.

Example 1-29. Logs coming off a Spring Cloud Sleuth-instrumented application
2016-02-11 17:12:45.404 INFO [my-service-id,73b62c0f90d11e06,73b6etydf90d11e06,false]
  85184 --- [nio-8080-exec-1] com.example.MySimpleComponentMakingARequest     : ...

In that example, my-service-id is the spring.application.name, 73b62c0f90d11e06 is the trace ID, and 73b6etydf90d11e06 is the span ID. This information is very useful, and you can use whatever log analytics tools you have at your disposal to mine it; you can see the flow of a request through different services if you have all the logs, and the trace information, in a single place available for query and analysis.

Spring Cloud Sleuth instrumentation usually consists of two components: an object that does the tracing of some subsystem, and the specific SpanInjector<T> instance for that subsystem. The tracer is usually some sort of interceptor, listener, filter, etc., that you can insert into the request flow for the component under trace. You can create and contribute your own tracing if for some reason the component you need isn’t already accounted for out of the box.

How Much Data Is Enough?

Which requests should be traced? Ideally, you’ll want enough data to see trends reflective of live, operational traffic. You don’t want to overwhelm your logging and analysis infrastructure, though. Some organizations may only keep requests for every thousand requests, or every ten, or every million! By default, the threshold is 10%, or .1, though you may override it by configuring a sampling percentage (Example 1-30).

Example 1-30. Changing the sampling threshold percentage
spring.sleuth.sampler.percentage = 0.2

Alternatively, you may register your own Sampler bean definition and make the decision about which requests should be sampled. You can make more intelligent choices about which things to trace, for example, by ignoring successful requests, perhaps checking whether some component is in an error state, or really anything else. Example 1-31 shows the Sampler definition.

Example 1-31. The Spring Cloud Sampler interface
package org.springframework.cloud.sleuth;

import org.springframework.cloud.sleuth.Span;

public interface Sampler {
    boolean isSampled(Span s);
}

The Span given as an argument represents the span for the current in-flight request in the larger trace. You can do interesting and request-specific types of sampling if you’d like. You might decide to only sample requests that have a 500 HTTP status code, for example.

Make sure to set realistic expectations for your application and infrastructure. It may well be that the usage patterns for your applications require something more sensitive or less sensitive to detect trends and patterns. This is meant to be online telemetry; most organizations don’t warehouse this data more than a few days or, at the upper bound, a week.

OpenZipkin: A Picture Is Worth a Thousand Traces

Data collection is a start, but the goal is to understand the data, not just collect it. In order to appreciate the big picture, we need to get beyond individual events. We’ll use the OpenZipkin project. OpenZipkin is the open source version of Zipkin (Figure 1-6), a project that originated at Twitter in 2010, and is based on the Google Dapper papers.

OpenZipkin Logo
Figure 1-6. OpenZipkin is the open source version of Zipkin
Note

Previously, the open source version of Zipkin evolved at a different pace than the version used internally at Twitter. OpenZipkin represents the synchronization of those efforts: OpenZipkin is Zipkin and when we refer to Zipkin in this post, we’re referring to the version reflected in OpenZipkin.

Zipkin provides a REST API that clients talk to directly. This REST API is written with Spring MVC and Spring Boot. Zipkin even supports a Spring Boot-based implementation of this REST API. Using that is as simple as using Zipkin’s @EnableZipkinServer directly. The Zipkin Server delegates writes to the persistence tier via a SpanStore. Presently, there is support for using MySQL or an in-memory SpanStore out of the box.

As an alternative to talking to the Zipkin REST API directly, we can also publish messages to the Zipkin server over a Spring Cloud Stream binder like RabbitMQ or Apache Kafka, which is what you see in Example 1-32. Create a new Spring Boot application, add org.springframework.cloud : spring-cloud-sleuth-zipkin-stream to the classpath, and then add @EnableZipkinStreamServer to a Spring Boot application to accept and adapt incoming Spring Cloud Stream-based Sleuth Span instances into Zipkin’s Span type. It will then persist them, using a configured SpanStore. You may use whatever Spring Cloud Stream binding you like, but in this case we’ll use Spring Cloud Stream RabbitMQ ( org.springframework.cloud : spring-cloud-starter-stream-rabbitmq).

Example 1-32. The Zipkin server code
package demo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.sleuth.zipkin.stream.EnableZipkinStreamServer;

1
@EnableZipkinStreamServer
@SpringBootApplication
public class ZipkinApplication {

 public static void main(String[] args) {
  SpringApplication.run(ZipkinApplication.class, args);
 }
}
1

This tells the Zipkin server to listen for incoming spans

Add the Zipkin UI (io.zipkin : zipkin-ui) to the classpath of the Zipkin Stream server to visualize requests. Bring up the UI (also at http://localhost:9411, where the stream server lives) and you find all the recent traces, if there are any. If there aren’t, let’s create some.

With the server up and running, we can stand up a couple of clients and make some requests. Let’s look at two trivial services, imaginatively named zipkin-client-a and zipkin-client-b. Both services have the required binder (org.springframework.cloud : spring-cloud-starter-stream-rabbit) and the Spring Cloud Sleuth Stream client (org.springframework.cloud : spring-cloud-sleuth-stream) on the classpath.

The client, zipkin-client-a, is configured to run on port 8082. There is a property, message-service, to tell the client where to find its service. You could as easily use service registration and discovery here, though. The client makes a request of the downstream service using a RestTemplate bean defined in the main class. It’s important that the client—in this case the RestTemplate—be a Spring bean. The Spring Cloud Sleuth configuration needs to know where to find the bean if it’s to be able to configure a Sleuth-aware interceptor for all requests that flow through it (Example 1-33).

Example 1-33. The message client
package demo;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.core.ParameterizedTypeReference;
import org.springframework.http.HttpMethod;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;

import java.util.Map;

@RestController
class MessageClientRestController {

 @Autowired
 private RestTemplate restTemplate;

 @Value("${message-service}")
 private String host;

 @RequestMapping("/")
 Map<String, String> message() {

  //@formatter:off
  ParameterizedTypeReference<Map<String, String>> ptr =
          new ParameterizedTypeReference<Map<String, String>>() { };
  //@formatter:on

  return this.restTemplate.exchange(this.host, HttpMethod.GET, null, ptr)
   .getBody();
 }
}

The service, zipkin-client-b, is configured to run on port 8081. It takes all the trace headers from the inbound request and includes them in its replies, along with a message (Example 1-34).

Example 1-34. The message service
package demo;

import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import javax.servlet.http.HttpServletRequest;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

@RestController
class MessageServiceRestController {

 @RequestMapping("/")
 Map<String, String> message(HttpServletRequest httpRequest) {

  List<String> traceHeaders = Collections.list(httpRequest.getHeaderNames())
   .stream().filter(h -> h.toLowerCase().startsWith("x-"))
   .collect(Collectors.toList()); 1

  Map<String, String> response = new HashMap<>();
  response.put("message", "Hi, @ " + System.currentTimeMillis());
  traceHeaders.forEach(h -> response.put(h, httpRequest.getHeader(h)));
  return response;
 }
}
1

Collect all headers contributed by Spring Cloud Sleuth (those starting with x-) from the outgoing request from zipkin-client-a and include them in the generated JSON response, along with a unique message.

Make a few requests at http://localhost:8082. You’ll get replies similar to what we see in Example 1-35.

Example 1-35. A sample reply coming from the traced request
{
   "x-b3-parentspanid" : "9aa83c71878b6cd4",
   "x-b3-sampled" : "1",
   "message" : "Hi, 1493358280026",
   "x-b3-traceid" : "9aa83c71878b6cd4",
   "x-span-name" : "http:",
   "x-b3-spanid" : "668b8e088a35f1db"
}

Now you can inspect the requests in the Zipkin server, at http://localhost:9411. You can sort by most recent, longest, etc., for finer-grained control over which results you see. In Figure 1-7, we see the results when searching for traces in the Zipkin server.

the results of doing a search for traces on the Zipkin main page
Figure 1-7. The results of a search for traces on the Zipkin main page

You can inspect the trace’s details, as shown in Figure 1-8.

details page showing the individual spans for a single trace
Figure 1-8. Details page showing the individual spans for a single trace

Each individual span also carries with it information (tags) about the particular request it’s associated with. You can view this detail by clicking on an individual span, as shown in Figure 1-9.

details panel showing relevant information and associated tags for a given span
Figure 1-9. Details panel showing relevant information and associated tags for a given span

Zipkin is an enviable position: it knows how services interact with each other. It knows the topology of your system. It’ll even generate a handy visualization of that topology if you click on the Dependencies tab, as depicted in Figure 1-10.

visualization of the topology of your services
Figure 1-10. Visualization of the topology of your services

Each element in the visualization can give you further information still, including which components use it and how many (traced) calls have been made. You can see this in Figure 1-11.

details for each service in the dependencies visualization
Figure 1-11. Details for each service in the dependencies visualization

If you move your application into a cloud platform, like Cloud Foundry, the routing infrastructure should be smart enough to also originate or perpetuate trace headers. Cloud Foundry does: as requests enter the system at the cloud router, headers are added or perpetuated to your running application (like your Spring application).

Tracing Other Platforms and Technologies

For Spring-based workloads, distributed tracing couldn’t be easier. However, tracing, by its very nature, is a cross-cutting concern for all services, no matter which technology stack they’re implemented in. The OpenTracing initiative is an effort to standardize the vocabulary and concepts of modern tracing for multiple languages and platforms. The OpenTracing API has support from multiple very large organizations and has as its lead one of the original authors on the original Google Dapper paper. The effort defines language bindings; there are already implementations for JavaScript, Python, Go, etc. The Spring team will keep Spring Cloud Sleuth conceptually compatible with this effort and will track it. It is expected, but not implied, that the bindings will as often as not have Zipkin as their backend.

Warning

The OpenTracing effort is relatively nascent, so you may find that it’s easier to just use an OpenZipkin client binding for another langugage, instead of the OpenTracing-based implementation.

Dashboards

Thus far we’ve mostly looked at ways to surface information per node and how to customize that information. But that’s only useful insofar as we can connect it to a bigger picture about the larger system. The Actuator, for example, publishes information about any given node, but assumes there’s some sort of infrastructure to soak up this information and consolidate it, similar to the way Google manages services with their Borg Monitoring (“Borgmon”) approach. Borgmon is a centralized management and monitoring solution used inside Google, but it relies on each node exposing service information. Borgmon-aware services publish information over HTTP endpoints, even if the services they monitor aren’t themselves HTTP. We’ll see several options in this lesson on how to centralize and visualize the system itself, beyond the node-by-node endpoints provided by Actuator.

In this section we’ll look at a few handy tools that support the ever-important dashboard experience that both operations and business will appreciate. These dashboards often build on the tools we’ve looked at so far, presenting the relevant information in a single, at-a-glance experience. These tools rely on service registration and discovery to discover services in a system and then surface information about them. In this section, we rely on a Netflix Eureka registry being available so that our dashboards can discover and monitor the deployed services in our system. We might alternatively use Hashicorp Consul or Apache Zookeeper, or any other registry for which there’s a Spring Cloud DiscoveryClient abstraction implementation available.

Monitoring Downstream Services with the Hystrix Dashboard

We can’t add instrumention to other teams’ code. We can’t insist that they build their applications using best-of-breed technologies like Cloud Foundry and Spring Cloud. We can’t make other teams, and other organizations, do anything, usually. The best we can do is protect ourselves from potential failures in downstream code. One way to do this is to wrap potentially shaky service-to-service calls with a circuit breaker.

Spring Cloud supports easy integration with the Netflix Hystrix circuit breaker. Let’s look at a trivial example that randomly inserts failure when issuing calls to either http://google.com or http://yahoo.com (Example 1-36).

Example 1-36. Use the Hystrix circuit breaker (assuming that you’ve specified @EnableCircuitBreaker somewhere)
package com.example;

import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;

import java.net.URI;
import java.util.Random;

@RestController
class ShakyRestController {

 @Autowired
 private RestTemplate restTemplate;

 1
 public ResponseEntity<String> fallback() {
  return ResponseEntity.ok("ONOES");
 }

 2
 @HystrixCommand(fallbackMethod = "fallback")
 @RequestMapping(method = RequestMethod.GET, value = "/google")
 public ResponseEntity<String> google() {
  return this.proxy(URI.create("http://www.google.com/"));
 }

 @HystrixCommand(fallbackMethod = "fallback")
 @RequestMapping(method = RequestMethod.GET, value = "/yahoo")
 public ResponseEntity<String> yahoo() {
  return this.proxy(URI.create("http://www.yahoo.com"));
 }

 private ResponseEntity<String> proxy(URI url) {

  if (new Random().nextInt(100) > 50) {
   throw new RuntimeException("tripping circuit breaker!");
  }

  ResponseEntity<String> responseEntity = this.restTemplate.getForEntity(url,
   String.class);

  return ResponseEntity.ok()
   .contentType(responseEntity.getHeaders().getContentType())
   .body(responseEntity.getBody());
 }

}
1

Provide a fallback behavior that gets called when a circuit throws an exception. In this case we return a silly String.

2

We decorate our various REST calls with a circuit breaker so that downstream service calls that may fail are safely handled.

Each node that uses the Hystrix circuit breaker also emits a server-sent event (SSE) heartbeat stream for every node that contains a circuit breaker. The SSE stream is accessible from http://localhost:8000/hystrix.stream, assuming the default configuration and that the above code was listening on port 8000. That stream is constantly updated. It contains information about the flow of traffic through the circuit, including how many requests were made, whether the circuit is open (in which case requests are failing and being diverted to the fallback handler) or closed (and thus requests that are attempted succeed in reaching the downstream service), and statistics about the traffic itself. While we cannot instrument other people’s services, we can monitor the flow of requests through our circuit breakers as a sort of virtual monitor for downstream services. If the circuit breaker is open, and requests aren’t going through, then it probably indicates that the downstream service is down.

We can monitor the circuit breakers. The circuit breaker stream is not much to look at directly, but you can take that stream and give it to another component, the Hystrix Dashboard, which visualizes the flow of requests through the circuits on a specific node.

Create a new Spring Boot application and add org.springframework.cloud : spring-cloud-starter-hystrix-dashboard to the classpath. Add @EnableHystrixDashboard to a @Configuration class and then start the application. The Hystrix Dashboard UI is available at /hystrix.html.

Not bad! But this is still just one node. It is untenable at scale, where you have more than one instance of the same service, or multiple services besides. A single-node-aware Hystrix Dashboard will not be very useful; you’d have to plug in every /hystrix.stream address one at a time. We can use Spring Cloud Turbine to multiplex all the streams from all the nodes into one stream. We can then plug the resulting aggregate stream into our Hystrix Dashboard. Spring Cloud Turbine can aggregate services using service registration and discovery (via the Spring Cloud DiscoveryClient service registry abstraction) or through messaging brokers like RabbitMQ and Apache Kafka exposed (via the Spring Cloud Stream messaging abstraction).

Add org.springframework.boot : spring-boot-starter-web, org.springframework.cloud : spring-cloud-starter-stream-rabbit, and org.springframework.cloud : spring-cloud-starter-turbine-stream to a new Spring Boot application (Example 1-37).

Example 1-37. Use Spring Cloud Turbine to aggregate the the server-sent event heartbeat streams from multiple circuit breakers, across multiple nodes, into one stream
package com.example;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.netflix.turbine.stream.EnableTurbineStream;

1
@EnableTurbineStream
@SpringBootApplication
public class TurbineApplication {

 public static void main(String[] args) {
  SpringApplication.run(TurbineApplication.class, args);
 }
}
1

Stand up the Spring Cloud Stream-based Turbine aggregation stream.

When the Spring Cloud Turbine service starts, it will serve a stream at http://localhost:8989/hystrix.stream, where 8989 is the default port. You can override the port by specifying turbine.stream.port. We specified 8010 for this example.

All clients that have circuit breakers in them will need to be updated a bit to support the involvement of Spring Cloud Turbine. Add a Spring Cloud Stream binding (we’re using spring-cloud-starter-stream-rabbit), and then add org.springframework.cloud : spring-cloud-netflix-hystrix-stream. This last dependency adapts circuit breaker status updates to messages sent over your particular Spring Cloud Stream binder choice. As part of this, Spring Cloud Turbine will need information about the local node and the cluster. The easiest way to give it that information is to use service registration and discovery. Add a DiscoveryClient abstraction implementation (we’re using org.springframework.cloud : spring-cloud-starter-eureka) to the classpath, as well.

Tip

This will of course require that Netflix’s Eureka service registry be running somewhere as well. We’ll use a service registry a fair amount as we look at ways to observe systems, not just individual nodes. You may as well get one running and keep it running if you’re following along from here.

Restart your client and then revisit the circuit breaker dashboard. Plug in the hystrix.stream endpoint from the Spring Cloud Turbine (http://localhost:8010/hystrix.stream, if you’re using our code). Figure 1-12 shows the results of viewing the Hystrix stream.

the Hystrix Dashboard
Figure 1-12. The Hystrix Dashboard
Tip

Technologies like the Hystrix Dashboard are important, but they will add a cost to your operational overhead. Ideally, this competency should be managed by the platform, and automated. If you’re using Cloud Foundry, there is a Hystrix Dashboard backing service in the service catalog that is already wired to use Spring Cloud Turbine, ready to use.

Codecentric’s Spring Boot Admin

Spring Boot Admin is a project from the folks at Codecentric. It provides an aggregated view of services and supports dropping down into a Spring Boot-based service’s Actuator-exposed endpoints (logs, JMX environment, request logs, etc.)

To use it, you’ll need to stand up a service registry (we’re using the pre-described Netflix Eureka instance). Set up a new Spring Boot application and add the following dependencies to the classpath: de.codecentric : spring-boot-admin-server and de.codecentric : spring-boot-admin-server-ui. The version itself will vary, of course, so check out the Git repository and your favorite Maven repository. In this example, we’re using 1.5.0 (Example 1-38).

Example 1-38. Standing up an instance of the Spring Boot Admin
package com.example;

import de.codecentric.boot.admin.config.EnableAdminServer;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;

@EnableDiscoveryClient
@EnableAdminServer
@SpringBootApplication
public class SpringBootAdminApplication {

 public static void main(String[] args) {
  SpringApplication.run(SpringBootAdminApplication.class, args);
 }
}

We’ll connect the clients to the server with Spring Cloud’s DiscoveryClient support. Add org.springframework.cloud : spring-cloud-starter-eureka and the @EnableDiscoveryClient annotation to both the Spring Boot Admin service and all the clients. We avoid the need for our clients to be explicitly aware of the Spring Boot Admin with service registration and discovery. You may alternatively use the Spring Boot Admin client dependency, de.codecentric : spring-boot-admin-starter-client. If you use the Spring Boot Admin client dependency, then you’ll need to specify a spring.boot.admin.url property, pointing the clients to the Spring Boot Admin server instance.

Your client will need the Spring Boot Actuator, as well. In Spring Boot 1.5 or greater, the Actuator endpoints are locked down and require authentication. The simplest way to get around that is to disable authentication (management.security.enabled = false) for the management endpoints themselves; otherwise some functionality in the Spring Boot Admin will not work.

Start your client and then visit the Spring Boot Admin (shown in Figure 1-13).

the central screen for the Spring Boot Admin that lists registered services
Figure 1-13. The central screen for the Spring Boot Admin that lists registered services

In our example, we run it on port 8080, as shown in Figures 1-14 and 1-15.

the Spring Boot Admin enumeration of requests (`/trace`)
Figure 1-14. The Spring Boot Admin enumeration of requests (/trace)
the Spring Boot Admin details screen
Figure 1-15. The Spring Boot Admin details screen

The Spring Boot Admin gives yet another way to see the aggregation of services in a system and drill down into their state.

Ordina Microservices Dashboard

Ordina’s JWorks division created another dashboard that provides a very handy visual enumeration of the registered services in a system. It also discovers services through Spring Cloud’s DiscoveryClient support, so you’ll need the aforementioned service registry stood up and an implementation on the client classpath (Example 1-39).

Example 1-39. Standing up an instance of the Microservices Dashboard
package com.example;

import be.ordina.msdashboard.EnableMicroservicesDashboardServer;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;

@SpringBootApplication
@EnableDiscoveryClient
@EnableMicroservicesDashboardServer
public class MicroservicesDashboardServerApplication {

 public static void main(String[] args) {
  SpringApplication.run(MicroservicesDashboardServerApplication.class, args);
 }
}

The Microservices Dashboard (Figure 1-16) provides a visualization of how services are connected together. It features four different lanes meant to reflect layers of components in a system:

  • UI components are just that: user interface components, like Angular directives, for example.

  • Resources might be information drawn from the Spring Boot Actuator /mappings endpoint with the default spring mappings excluded or hypermedia links exposed on an index resource through an index controller.

  • Microservices are services discovered (Spring Boot or not) using Spring Cloud’s DiscoveryClient abstraction.

  • Backends are any HealthIndicators found on the discovered microservices—components that services depend upon that may fail.

the Microservices Dashboard enumerating the disocovered services and provding information on interesting components in those services
Figure 1-16. The Microservices Dashboard enumerating the discovered services and providing information on interesting components in those services

The Microservices Dashboard supports drilling down by node states, name, types, and groups. You can add virtual nodes—nodes that aren’t automatically discovered but about which you’d like to make the Microservices Dashboard aware. If nothing else, they could be placeholders for things that should be there, eventually, for planning purposes.

Pivotal Cloud Foundry’s AppsManager

The two dashboards we just looked at rely on Spring Boot’s Actuator to surface information about the Java process. Ultimately, though, you’re going to run code on a platform, and in that platform there will be other moving parts, like the container (which in turn has its own health to be monitored) that runs the the application process itself, backing services, and so on. If you’re using something like Cloud Foundry, there is no reason you couldn’t run .NET applications or Python applications or whatever other technologies you’d like, and they too will have their own status. Arguably, the optimal dashboard would centralize visibility into all applications in a system. If more detailed diagnostics, such as those in the Spring Boot Actuator, are available, then that resolution of detail should be visible, too. The Pivotal Cloud Foundry AppsManager gets this just right. We have touched on it already in this lesson, so we’ll only present an overview of the AppsManager.

We can review all the applications for a given user who belongs to a particular organization and a particular space—this is shown in Figure 1-17.

AppsManager overview
Figure 1-17. All the applications in a particular Cloud Foundry space, which is in turn part of an organization

We can review the details of an individual application as well, as shown in Figure 1-18.

AppsManager overview
Figure 1-18. The details of a particular Cloud Foundry application. Information from the Spring Boot Actuator is visible as well.

Remediation

Thus far we’ve focused on surfacing information about the state of the system, on improving visibility. What do we do with this knowledge? In a static, pre-cloud environment, improved visibility can be used to trigger alerting, which then (hopefully) results in a pager going off or somebody getting an email. By the time alerting has happened, it’s probably too late and somebody’s already had a bad experience on the system. Cloud computing (which supports manipulation with APIs) changes this dynamic: we can solve software problems with software. If, for example, the system needs more capacity, we don’t need to file an ITIL ticket; just make an API request. We can support automatic remediation.

Distributed computing changes the way we should think about application instance availability. In a sufficiently distributed system, having a single instance that is available 100% of the time becomes near impossible. Instead, the focus must be on building a system where service is somehow restored; we must optimize time to remediation. If time to remediation is zero seconds, then we are effectively 100% highly available, but this change in approach has profound implications in how we architect our system. We can only achieve this effect if we can program the platform itself.

The platform can even do basic remediation for you, automatically. Most cloud platforms provide health monitoring. If you ask Cloud Foundry to stand up 10 instances of an application, it’ll ensure that there are at least 10 instances. Even if an instance should die, Cloud Foundry will restart it.

Most cloud platforms, including Pivotal Web Services and Pivotal Cloud Foundry, support autoscaling. An autoscaler monitors container-level information like RAM and CPU and, if necessary, adds capacity by launching new application instances. On Pivotal Web Services, you need only create a service instance of the type app-autoscaler and then bind it to the application. You’ll need to configure it in the management panel on the Pivotal Web Services Console.

Example 1-40. Creating an autoscaler service on Pivotal Web Services
cf marketplace
Getting services from marketplace in org platform-eng / space joshlong as ..
OK

service         plans         description
..
app-autoscaler  bronze, gold  Scales bound applications in response to load (beta)
..

There is still room for yet more advanced remediation. There are some inspiring examples out there, like Netflix’s Winston, LinkedIn’s Nurse, Facebook’s FBAR, and the open source StackStorm. These tools make it easy to define pipelines composed of atoms of functionality that, when combined, solve a problem for you. These tools work in terms of well-known input events, triggered by monitoring agents or sensors or other indicators deployed in the system. In a traditional architecture these events would trigger alerting, which is useful, but for some classifications of problems it’s also possible to trigger automatic remediation flows.

We encourage you to investigate some of these approaches. Most of them are rooted in environments that don’t have the benefits of the layers of guarantees made by a platform like Cloud Foundry, though. In our case, we don’t need to solve problems like restarting services or load balancing. We also don’t need to worry about alerting for low-level things like heartbeat detection; the platform will manage that for us.

There are still some gaps in our visibility, though—things that we alone know to look for because we know the specifics of our architecture. It’s not hard to find the components in our architecture that tell a story about our system’s capacity. These components expose information that we can monitor, or react to. Spring Integration makes it simple to process events from different event sources and to string together components over messaging technologies like RabbitMQ or Apache Kafka. Spring Cloud Data Flow builds upon Spring Integration, providing a Bash shell-like DSL that lets you compose arbitrary streams of processing and then orchestrate them on a distributed processing fabric like YARN or Cloud Foundry. Spring Cloud Data Flow is an ideal toolbox to assemble processing pipelines that respond to insight coming from our own custom event sources.

There are a lot of things you might look at as prompts for action. We are in an enviable position—all the knobs and levers we need to turn or pull to solve many classes of problems have APIs. We can use software to fix faltering software. Let’s suppose we have a consumer that takes a nontrivial amount of time to respond to messages. We can look at the throughput of a queue in your message broker to decide that we should add more consumers to help more efficiently drain the queue and keep within a service-level agreement (SLA).

In the source code for this lesson there are a few handy Spring Cloud Data Flow sources (a component that produces events as messages) and sinks (a component that does something in response to events). The rabbit-queue-metrics source monitors a given RabbitMQ queue and publishes information about it: the queue depth (how many messages haven’t been processed), the consumer count (how many clients are listening for messages on the other end), and the queue name itself. The cloudfoundry-autoscaler sink responds to incoming messages (which should be a number) and adds or reduces instances for a given application until that number falls within a prescribed range. The range and relevance of the number is for you to ascribe. It’s trivial, once you’ve registered the custom sources and sinks, to connect these two things by taking the output of the source, extracting the queue size from the information, and then sending it to the autoscaler sink.

Example 1-41. Using Spring Cloud Data Flow to automatically scale applications based on queue depth
rabbit-queue-metrics
  --rabbitmq.metrics.queueName=remediation-demo.remediation-demo-group |
 transform --expression=headers['queue-size'] |
 cloudfoundry-autoscaler --applicationName=remediation-rabbitmq-consumer
   --instanceCountMaximum=10  --thresholdMaximum=5

There are a lot of input variables that go into understanding how to support and fix a broken system. If you capture and distill those events, and connect them to automatic handlers, you have the basis for an automated incident response system for certain simple classes of problems. Spring Cloud Data Flow is purpose-built for this sort of ad hoc event-monitoring and response-based approach. In the source code for this lesson we’ve also included a Spring Cloud Data Flow source component to monitor Cloud Foundry application metrics like RAM and hard disk usage. You could build a similar remediation flow based on an application’s other metrics.

Summary

We’ve only begun to look at the possibilities in this lesson; if you’re feeling overwhelmed, good! This subject is of a critical importance in a cloud native system, and failure to architect with observability in mind only tempts disaster. These are often called “day two problems”—things you won’t realize you need until after you get to production. It is dramatically less painful on “day two” if you think about these things on “day one.” Spring Boot, Spring Cloud, and Cloud Foundry are purpose-built to quickly and easily integrate and support these requirements with a minimum of ceremony.

You’ll note that, while a lot of what we talked about in this lesson was introduced in terms of open source clients, some of the backing services we mentioned are hosted SaaS offerings, and they typically cost. This is a feature; as explained earlier, our goal is to never run software we can’t sell. It’s better to let the platform, and third parties for whom these concerns are core competencies, satisfy these nonfunctional requirements instead.

Consider extracting all these production-centric requirements into a separate Spring Boot auto-configuration on which microservices in your organization build. You can even create your own starter dependencies and meta-annotations. This way, you need only configure how you handle logs, metrics, tracing, etc., once, and then simply add the appropriate auto-configuration to the classpath. If you are using a platform like Cloud Foundry, it is trivial to stand up the relevant backing services to support your application. Taken together, an application framework like Spring Boot (which supports twelve-factor applications) and a platform like Cloud Foundry (which supports twelve-factor operations) help you get to production, safely and quickly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset