2 A Monitoring and Measurement Framework

In forthcoming chapters, we’ll build our monitoring framework. But first, in these initial chapters, we’re going to look at data collection, metrics, aggregation, and visualization. Then we’ll expand the framework to collect application and business metrics, culminating in a capstone chapter where we’ll put everything together. We’ll build a framework that focuses on events and metrics and collects data in a scalable and robust way.

In our new monitoring paradigm, events and metrics are going to be at the core of our solution. This data will provide the source of truth for:

  • The state of our environment
  • The performance of our environment

Visualization of this data will also allow for the ready expression and interpretation of complex ideas that would otherwise take thousands of words or hours of explanation.

In this chapter we’re going to step through our proposed monitoring framework. We’ll introduce the basic concepts and lay the groundwork that will help you understand the choice of tools and techniques we’ve made later in the book.

To implement our monitoring framework we’re proposing a new architecture.

Monitoring framework
Monitoring framework

Our new architecture is going to:

  • Allow us to easily visualize the state of our environment.
  • Be event, log, and metrics-centric.
  • Focus on “whitebox” or push-based monitoring instead of “blackbox” or pull-based monitoring.
  • Provide contextual and useful notifications.

These objectives will allow us to take our Reactive Example.com environment closer to the Proactive model, and to ensure we monitor the right components in the right way.

Let’s examine our new architecture.

2.1 Pull versus Push

We’re going to fundamentally change the architecture of how we perform monitoring. Most monitoring systems are pull/polling-based systems. An excellent example is Nagios. With Nagios, your monitoring system generally queries the components being monitored; a classic check might be an ICMP-based ping of a host. This means that the more hosts and services you manage in your environment, the more checks your Nagios host needs to execute and process. We then need to scale our monitoring vertically or via partition to address growth.

We’re going to, wherever possible, avoid pull-based monitoring in favor of push-based monitoring. With a push-based architecture, hosts, services, and applications are emitters, sending data to a central collector. The collection is fully distributed on the hosts, services, and applications that emit data, resulting in linear scalability. This means monitoring is no longer a monolithic central function, and we don’t need to vertically scale or partition that monolith as more checks are added.

Emitters report when they are available. Generally emitters are stateless, sending data as soon as it is generated. They can use transports and mechanisms local and appropriate to themselves, rather than being forced into a choice by your monitoring tools. This enables us to build modular, functionally separated, compartmentalized monitoring solutions with selected best-of-breed tools rather than monolithic silos.

Pull-based approaches also require your monitoring targets to be centrally configured on what and where to monitor. With push, your emitters, host, services, and applications send data when they start, and push metrics at destinations you’ve configured. This is especially important in dynamic environments, where a short-lived activity might not have sufficient time to be discovered or converged into configuration by a pull-based monitor. With a push-based architecture this isn’t an issue because the emitter controls when and where the data is sent.

We also get a broad security dividend from a push-based architecture: emitters are inherently more secure against remote attacks since they do not listen for network connections. This decreases the attack surface of our hosts, services, and applications. Additionally, this reduces the operational complexity of any security model, as networks and firewalls only need to be configured for unidirectional communication from emitters to collector.

Polling-based systems also generally emphasize monitoring availability—“Is it up?”—and the minimization of downtime. Where we do use polling systems we’ll limit their focus to this sort of coarse-grained availability monitoring. Polling-based systems also provide a strong focus on small, atomic actions—for example, telling you that an Nginx daemon has stopped working. This can be hugely attractive, because fixing those atomic actions is often much easier and simpler than addressing more systemic issues, such as a 10% side-wide increase in HTTP 500 errors.

You may be thinking, “Hey, what’s wrong with that?” Well, there’s nothing fundamentally wrong with it except that it reinforces the view that IT is a cost center. Orienting your focus toward availability, rather than quality and service, treats IT assets as pure capital and operational expenditure. They aren’t assets that deliver value, they are just assets that need to be managed. Organizations that view IT as a cost center tend to be happy to limit or cut budgets, outsource services, and not invest in new programs because they only see cost and not value.

Thankfully, IT organizations have started to be viewed in a more positive light. Organizations have recognized that it’s not only impossible to do business without high-quality IT services but that they are actually market differentiators. If you do IT better than your competitors then this is a marketable asset. Adding to this is the popularity and flexibility of virtualization, elastic computing like Cloud, and the introduction of Software-as-a-Service (SaaS). Now the perception has started to move IT from a cost center to, if not an actual revenue center, then at least a lever for increasing revenue. This change has consequences though, with the most important being that we now need to measure the quality and performance of IT, not just its availability. This data is crucial to the business and technology both making good decisions.

Push-based models also tend to be more focused on measurement. You still get availability measurement, but as a side effect of measuring components and services. As collection is distributed and generally low overhead, you can also push a lot of data and store it at high precision. This increased precision of data can then be used to more quickly answer questions about quality of service, performance, and availability, and to power decisions around spending, headcount, and new programs. This changes the focus inside your IT organization towards measuring value, throughput, and performance—all levers that are about revenue rather than cost.

2.2 Blackbox and Whitebox

Our architecture will also focus on whitebox monitoring more than blackbox monitoring. Blackbox monitoring probes the outside of a service or application, a lot of pull-based systems use this as their basic building block. You query the external characteristics of a service: does it respond to a poll on an open port, return the correct data or response code. An example of blackbox monitoring is performing an HTTP check and confirming you have received a 200 OK response code.

Whitebox monitoring instead focuses on what’s inside the service or application. The application is instrumented and returns its state, the state of internal components, or the performance of transactions or events. Most whitebox monitoring is done either via emitting events, logs and metrics to a monitoring tool, the approach we’ve detailed about in our push-based model, or exposes this information on a status page of some kind, which a pull-based system would query.

The whitebox approach provides an idea of the actual running state of your service or application. It allows you to communicate a much richer, more contextual, set of information about the state of your application than blackbox monitoring. It provides a better approach to exposing the information both you and the business require to monitor your application.

This is not to say blackbox monitoring has no place. It is often useful to know what the state of external factors of a service are, especially if the service is provided by a third-party and where you don’t have insight into its internal operations. It is often also useful to view your service or application from outside to understand certain types of networking, security or availability issues. We’ll use blackbox monitoring where appropriate in the book.

2.3 Event, log, and metric-centered

Our new push and whitebox-centric architecture is going to be centered around collecting event and metric data. We’ll use that data to monitor our environment and detect when things go wrong.

  • Events — We’ll generally use events to let us know about changes and occurrences in our environment.
  • Logs — Logs are a subset of events. While they’re helpful for letting us know what’s happening, they’re often most useful for fault diagnosis and investigation.
  • Metrics — Of all these data sources, we’ll rely most heavily on metrics to help us understand what’s going on in our environment. Let’s take a deeper look at metrics.

2.3.1 More about metrics

Metrics always appear to be the most straightforward part of any monitoring architecture. As a result we sometimes don’t invest quite enough time in understanding what we’re collecting, why we’re collecting it, and what we’re doing with our metrics.

Indeed, in a lot of monitoring frameworks, the focus is on fault detection: detecting if a specific system event or state has occurred (more on this below). When we receive a notification about a specific system event, usually we go look at whatever metrics we’re collecting, if any, to find out what exactly has happened and why. In this world, metrics are seen as a by-product of or a supplement to our fault detection.

Tip See the discussion later in this chapter about notification design for further reasons why this is a challenging problem.

We’re going to change this idea of metrics-as-supplement. Metrics are going to be the most important part of your monitoring workflow. We’re going to turn the fault-detection-centric model on its head. Metrics will provide the state and availability of your environment and its performance.

Our framework avoids duplicating Boolean status checks when a metric can provide information on both state and performance. Harnessed correctly, metrics provide a dynamic, real-time picture of the state of your infrastructure that will help you manage and make good decisions about your environment.

Additionally, through anomaly detection and pattern analysis, metrics have the potential to identify faults or issues before they occur or before the specific system event that indicates an outage is generated.

2.3.2 So what’s a metric?

As metrics and measurement are so critical to our monitoring framework, we’re going to help you understand what metrics are and how to work with them. This is intended to be a simplified background that will allow you to understand what different types of metrics, data, and visualizations will contribute to our monitoring framework.

Metrics are measures of properties in pieces of software or hardware. To make a metric useful we keep track of its state, generally recording data points or observations over time. An observation is a value, a timestamp, and sometimes a series of properties that describe the observation, such as a source or tags. The combination of these data point observations is called a time series.

A classic example of a metric we might collect as a time series is website visits, or hits. We periodically collect observations about our website hits, recording the number of hits and the times of the observations. We might also collect properties such as the source of a hit, which server was hit, or a variety of other information.

We generally collect observations at a fixed time interval—we call this the granularity or resolution. This could range from one second to five minutes to 60 minutes or more. Choosing the right granularity at which to record a metric is critical. Choose too coarse a granularity and you can easily miss the detail. For example, sampling CPU or memory usage at five-minute intervals is highly unlikely to identify anomalies in your data. Alternatively, choosing fine granularity can result in the need to store and interpret large amounts of data. We’ll talk more in Chapter 4 about this.

So, a time-series metric is generally a chronologically ordered list of these observations. Time-series metrics are often visualized, sometimes with a mathematical function applied, as a two-dimensional plot with data values on the y-axis and time on the x-axis. Often you’ll see multiple data values plotted on the y-axis—for example, the CPU usage values from multiple hosts or successful and unsuccessful transactions.

A sample plot
A sample plot

These plots can be incredibly useful to us. They provide us with a visual representation of critical data that is (relatively) easy to interpret, certainly with more facility than perusing the same data in the form of a list of values. They also present us with a historical view of whatever we’re monitoring—they show us what has changed and when. We can use both of these capabilities to understand what’s happening in our environment and when it happened.

2.3.3 Types of metrics

There are a variety of different types of metrics we’ll see in our environment.

2.3.3.1 Gauges

The first, and most common, type of metric we’ll see is a gauge. Gauges are numbers that are expected to change over time. A gauge is essentially a snapshot of a specific measurement. The classic metrics of CPU, memory, and disk usage are usually articulated as gauges. For business metrics, a gauge might be the number of customers present on a site.

A sample gauge
A sample gauge

2.3.3.2 Counters

The second type of metric we’ll see frequently is a counter. Counters are numbers that increase over time and never decrease. Although they never decrease, counters can sometimes reset to zero and start incrementing again. Good examples of application and infrastructure counters are system uptime, the number of bytes sent and received by a device, or the number of logins. Examples of business counters might be the number of sales in a month or cost of sales for a time period.

A sample counter
A sample counter

In this figure we have a counter incrementing over a period of time.

A useful thing about counters is that they let you calculate rates of change. Each observed value is a moment in time: t. You can subtract the value at t from the value at t+1 to get the rate of range between the two values. A lot of useful information can be understood by understanding the rate of change between two values. For example, the number of logins is marginally interesting, but create a rate from it and you can see the number of logins per second, which should help identify periods of site popularity.

2.3.3.3 Timers

We’ll also see a small selection of timers. Timers track how long something took. They are commonly used for application monitoring—for example, you might embed a timer at the start of a specific method and stop it at the end of the method. Each invocation of the method would result in the measurement of the time the method took to execute.

A sample timer
A sample timer

Here we have a timer measuring the average time in milliseconds of the execution of a payment method.

2.3.4 Metric summaries

Often the value of a single metric isn’t useful to us. Instead, visualization of a metric requires applying mathematical transformations to it. For example, we might apply statistical functions to our metric or to groups of metrics. Some common functions we might apply include:

  • Count or n — We count the number of observations in a specific time interval.

  • Sum — We sum (add together) values from all observations in a specific time interval.

  • Average — The mean of all values in a specific time interval.

  • Median — The median is the dead center of our values: exactly 50% of values are below it, and 50% are above it.

  • Percentiles — Measure the values below which a given percentage of observations in a group of observations fall.

  • Standard deviation — Standard deviation from the mean in the distribution of our metrics. This measures the variation in a data set. A standard deviation of 0 means the distribution is equal to the mean of the data. Higher deviations mean the data is spread out over a range of values.

  • Rates of change — Rates of change representations show the degree of change between data in a time series.

  • Frequency distribution and histograms - This is a frequency distribution of a data set. You group data together—a process which is called “binning”—and present the groups in such a way that their relative sizes are visualized. The most common visualization of a frequency distribution is a histogram.

A histogram example
A histogram example

Here we see a sample histogram for the frequency distribution of heights. On the y-axis we have the frequency and on the x-axis we have the distribution of heights. We see that for the height 160–165 cm tall there is a distribution of two.

Tip This is a brief introduction to these summary methods. We’ll talk about some of them in more detail later in the book.

2.3.5 Metric aggregation

In addition to summaries of specific metrics, you often want to show aggregated views of metrics from multiple sources, such as disk space usage of all your application servers. The most typical example of this results in multiple metrics being displayed on a single plot. This is useful in identifying broad trends over your environment. For example, an intermittent fault in a load balancer might result in web traffic dropping off for multiple servers. This is often easier to see in aggregate than by reviewing each individual metric.

An aggregated collection of metrics
An aggregated collection of metrics

In this plot we see disk usage from numerous hosts over a 30-day period. It gives us a quick way to ascertain the current state (and rate of change) of a group of hosts.

Ultimately you’ll find a combination of single and aggregate metrics—the former to drill down into specific issues, the latter to see the high-level state—provide the most representative view of the health of your environment.

2.4 Contextual and useful notifications

Notifications are the primary output from our monitoring architecture. They can consist of emails, instant messages, SMS messages, pop-ups, or anything else used to let you know about things in your environment that you need to be aware of. This seems like it should be a really simple domain but it contains a lot of complexity and is frequently poorly implemented and managed.

To build a good notification system you need to consider the basics of:

  • Who to tell about a problem.
  • How to tell them.
  • How often to tell them.
  • When to stop telling them, do something else, or escalate to someone else.

If you get it wrong and generate too many notifications then people will be unable to take action on them all and will generally mute them. We all have war stories of mailbox folders full of thousands of notification emails from monitoring systems. Sometimes so many notifications are generated that you suffer from “alert fatigue” and ignore them (or worse, conduct notification management via Select All -> Delete). Consequently you’re likely to miss actual critical notifications when they are sent.

Most importantly, you need to work out WHAT to tell whoever is receiving the notifications. Notifications are usually the sole signal that you receive to tell you that something is amiss or requires your attention. They need to be concise, articulate, accurate, digestible, and actionable. Designing your notifications to actually be useful is critical. Let’s make a brief digression and see why this matters. We’ll look at a typical Nagios notification for disk space.

PROBLEM Host: server.example.com
Service: Disk Space

State is now: WARNING for 0d 0h 2m 4s (was: WARNING) after 3/3 checks

Notification sent at: Thu Aug 7th 03:36:42 UTC 2015 (notification number 1)

Additional info:
DISK WARNING - free space: /data 678912 MB (9% inode=99%)

Now imagine you’ve just received this notification at 3:36 a.m. What does it tell you? That we have a host with a disk space warning. That the /data volume is 91% full. This seems useful at first glance but in reality it’s really not that practical. Firstly, is this a sudden increase? Or has this grown gradually? What’s the rate of expansion? For example, 9% disk space free on a 1GB partition is different from 9% disk free on a 1TB disk. Can I ignore or mute this notification or do I need to act now? Without this additional context my ability to take action on this notification is limited and I need to invest considerably more time to gather context.

In our framework we’re going to focus on:

  • Making notifications actionable, clear, and articulate. Just the use of notifications written by humans rather than computers can make a significant difference in the clarity and utility of those notifications.
  • Adding context to notifications. We’re going to send notifications that contain additional information about the component we’re notifying on.
  • Aligning our notifications with the business needs of the service being monitored so we only notify on what’s useful to the business.
Tip The simplest advice we can give here is to remember notifications are read by humans, not computers. Design them accordingly.

In Chapter 10 we’ll build notifications with greater context and add a notification system to the monitoring framework we’re building.

2.5 Visualization

Visualizing data is both an incredibly powerful analytic and interpretive technique and an amazing learning tool. Throughout the book we’ll look at ways to visualize the data and metrics we’ve collected. However metrics and their visualizations are often tricky to interpret. Humans tend towards apophenia—the perception of meaningful patterns within random data—when viewing visualizations. This often leads to sudden leaps from correlation to causation. This can be further exacerbated by the granularity and resolution of our available data, how we choose to represent it, and the scale on which we represent it.

Our ideal visualizations will clearly show the data, with an emphasis on highlighting substance over visuals. In this book we’ve tried to build visuals that subscribe to these broad rules:

  • Clearly show the data.
  • Induce the viewer to think about the substance, not the visuals.
  • Avoid distorting the data.
  • Make large data sets coherent.
  • Allow changing perspectives of granularity, without impacting comprehension.

We’ve drawn most of our ideas from Edward Tufte’s The Visual Display of Quantitative Information and throughly recommend reading it to help you build good visualizations.

There’s also a great post from the Datadog team on visualizing time-series data that is worth reading.

2.6 So why this architecture? What’s wrong with traditional monitoring?

In Chapter 1 we talked broadly about the problem space, how IT has changed, and why traditional monitoring fails to address that change. Let’s look more deeply into what’s broken and why this new architecture addresses those gaps.

When we describe “traditional monitoring,” especially in Reactive environments, what we’re usually talking about is fault detection. It is best articulated as watching an object so we know it’s working. Traditional monitoring is heavily focused on this active polling of objects to return their state—for example, an ICMP ping-based host availability check.

Historically fault detection checks have relied on Boolean decisions that indicate whether something responds or if a value falls within a range. Check selection and implementation is also simplistic and may be:

  • Experience or learning-based. You implement the same checks you’ve used in the past, or acquired through cargo cult’ed monitoring checks from sources like documentation, example configurations, or blog posts.
  • Reactive. You implement a check or checks in response to an incident or outage that has occurred in the past.

Boolean check design and experience-based and Reactive checks have some major design flaws. Let’s examine why they are issues.

2.6.1 Static configuration

The checks generally have static configuration. Your check probably needs to be updated every time your system grows, evolves, or changes. In virtual and cloud environments, a host or service being monitored may be highly ephemeral: appearing, disappearing, or migrating locations or hosts multiple times during its lifespan. Statically defined checks just don’t handle this changing landscape, resulting in checks (and faults) on resources that do not exist or that have changed.

Further, many monitoring systems require you to duplicate configuration on both a server and the object being monitored. This lack of a single source of truth leads to increased risk of inconsistency and difficulty in managing checks. It also generally means that the monitoring server needs to know about resources being monitored before they can be monitored. This is clearly problematic in dynamic or changing landscapes.

Additionally, updates to monitoring are often considered secondary to scaling or evolving the systems themselves. Many faults are thus the result of incorrect configuration or orphaned checks. These false positives take time and effort to diagnose and resolve. They clutter your monitoring environment, hiding actual issues and concerns. Many teams do not realize they can change or delete the existing checks to remove these false positives—they take the monitoring as gospel.

2.6.2 Inflexible logic and thresholds

The checks are often inflexible Boolean logic or arbitrary static in time thresholds. They generally rely on a specific result or range being matched. The checks again don’t consider the dynamism of most complex systems. A match or a breach in a threshold may be important or could have been triggered by an exceptional event—or it could even be a natural consequence of growth.

It’s our view that arbitrary static thresholds are always wrong. Database performance analysis vendor VividCortex’s CEO Baron Schwartz put it well:

They’re worse than a broken clock, which is at least right twice a day. A threshold is wrong for any given system, because all systems are slightly different, and it’s wrong for any given moment during the day, because systems experience constantly changing load and other circumstances.

Arbitrary static thresholds set up a point-in-time boundary. During that period everything beneath that boundary is judged normal and everything above is abnormal. That boundary is not only inflexible but it’s entirely artificial. One system’s abnormality may be another’s normal operation. That means notifications wired to arbitrary thresholds will frequently fire off false positives, unrelated to any actual problems.

Boolean checks suffer similar issues to arbitrary thresholds. They are usually singletons, and often can’t take advantage of trends or prior event history. Is this really a failure? Is it a critical failure? Is it merely flapping? Could one or more failures of this check (or even across a series of checks) actually be survivable, especially in the context of a resilient and well-architected application?

2.6.3 Object-centric

The checks are object-centric, usually centric to single hosts or services. They require you to define a check on an object or objects. This breaks down quickly because each of those objects is usually part of a much larger, often much more complex system. These single object checks frequently lack any context and limit your ability to understand what the check’s output means to the broader system. As a consequence it is often hard to determine the criticality of the object’s failure.

Of course, some monitoring systems do attempt to provide contextual layers above object checks, usually via grouping, but rarely manage to model beyond basic constructs. They also lack the ability to process the dynamism in most modern environments.

2.6.4 An interlude into pets and cattle

By the end of this book a lot of folks are probably going to be surprised by how few fault detection checks we actually build. Traditional monitoring environments are often marked by thousands of checks. So why aren’t we going to replicate those environments? Well, as we’ve discovered prior, those sorts of environments aren’t easy to manage, scale, or massively duplicate, and often aren’t actually helpful in doing fault diagnosis. There’s another factor, though, related to how the process of fault resolution is changing.

Bill Baker, a former Distinguished Engineer at Microsoft, once quipped that hosts are either pets or cattle. Pets have sweet names like Fido and Boots. They are lovingly raised and looked after. If something goes wrong with them you take them to the vet and nurse them back to health. Cattle have numbers. They are raised in herds and are basically identical. If something goes wrong with one of them, you put it down and replace it with another.

In the past hosts were pets. If they broke you fixed them, often nursing a host—named for a Simpson’s character–back to life multiple times, tweaking configuration, fiddling with settings, and generally investing time in resolving the issue.

In modern environments, hosts are cattle. They should be configured automatically and rebuilt automatically. If a server fails then you kill it and restart another, automatically building it back to a functioning state. Or if you need more capacity you can add additional hosts. In these environments you don’t need hundreds or thousands of checks on individual components because the default fix for significant numbers of those components is to rebuild the host or scale the service.

2.6.5 So what do we do differently?

We’ve identified issues with traditional fault detection checks, and we’re advocating replacing these traditional status checks with events and metrics, but what does that mean? Rather than infrastructure-centric checks like pinging a host to return its availability or monitoring a process to confirm if a service is running, we configure our hosts, services, and applications to emit events and metrics. We get two benefits from events and metrics, firstly:

If a metric is measuring, an event is reporting, or a log is spooling, then the service is available. If it stops measuring or reporting then it’s likely the service is not available.

Note What do we mean by available? The definition, for the purposes of this book, is that a host, service, or application is operable and functioning in line with expectations.

How will this work? The event router in our monitoring framework is responsible for tracking our events and metrics. It can potentially do a lot of useful things with those events and metrics including storing them, sending them to visualization tools, or using them and their values to notify us of performance issues. But most importantly it knows about the existence of those events and metrics. Let’s look at an example. We configure a web server to emit metrics showing the current workload. We then configure our event router to detect:

  • If the metric stops being reported.
  • If the value of a metric matches some criteria we’ve developed.

In the former case, if the metric disappears from our event router, we can be fairly certain that something has gone wrong. Either the web server has stopped working or something has happened between us and the server to prevent data from reaching our event router. In either case we’ve identified a fault that we may wish to investigate.

In the latter case, we get useful data from the payload of the event or metric. Not only is this data useful for long-term analysis of trends, performance, and capacity but it presents an opportunity to build a new paradigm for checking state. In our traditional monitoring model we rely on arbitrary thresholds to determine if we have an issue—for example, polling our CPU usage and reporting a warning if it is above a certain percentage. Now, instead of checking those arbitrary thresholds, we use a smarter approach. We can’t totally eliminate the need to set thresholds, but we can make our analysis a lot smarter by making the inputs to our thresholds more intelligent.

2.6.6 Smarter threshold inputs

In our new model we still use thresholds but the data we feed into those thresholds is considerably more sophisticated. We will generate better data and analysis and get a better understanding of the experience of our users from our collected metrics. All of this leads to the identification of valid issues and problems. In our new monitoring framework we will:

  • Collect frequent and high-resolution data.
  • Look at windows of data not static points in time.
  • Calculate smarter input data.

Using this methodology we’re more likely to identify if a state is an actual issue instead of an anomalous spike or transitory state.

We’ll look at collection of high-frequency data and techniques for viewing windows of data in the forthcoming chapters. But calculating smarter input data for our thresholds and checks requires some explanation of some of the possible techniques we could choose and some we shouldn’t use. Let’s take a look at why, why not, and how we might use averages, the median, standard deviation, percentiles, and other statistical choices.

Note This is a high-level overview of some statistical techniques rather than a deep dive into the topic. As a result, exploration of some topics may appear overly simplistic to folks with strong statistical or mathematical backgrounds.

2.6.6.1 Average

Averages are the de facto metric analysis method. Indeed, pretty much everyone who has ever monitored or analyzed a website or application has used averages. In the web operations world, for example, many companies live and die by the average response time of their site or API. Averages are attractive because they are easy to calculate. Let’s say we have a list of seven time-series values: 12, 22, 15, 3, 7, 94, and 39. To calculate the average we sum the list of values and divide the total by the number of values in the list.

(12 + 22 + 15 + 3 + 7 + 94 + 39) / 7 = 27.428571428571

We first sum the seven values to get the total of 192. We then divide the sum by the number of values, here 7, to return the average: 27.428571428571. Seems pretty simple huh? The devil, as they say, is in the details.

Averages assume there is a normal event or that your data is a normal distribution— for example, in our average response time it’s assumed all events run at equal speed or that response time distribution is roughly bell curved. But this is rarely the case with applications. There’s an old statistics joke about a statistician who jumps in a lake with an average depth of only 10 inches and nearly drowns.

The flaw of averages - Copyright Jeff Danzinger
The flaw of averages - Copyright Jeff Danzinger

So why did he nearly drown? The lake contained large areas of shallow water and some areas of deep water. Because there were larger areas of shallow water the average was lower overall. In the monitoring world the same principal applies: lots of low values in our average distort or hide high values and vice versa. These hidden outliers can mean that while we think most of our users are experiencing a quality service, there are potentially a significant number who are not.

Let’s look at an example, using response times and requests for a website.

Response time average
Response time average

Here we have a plot showing response time for a number of requests. Calculating the average response time would give us 4.46 seconds. The vast majority of our users would experience a (potentially) healthy 4.46 second response time. But many of our users are experiencing response times of up to 12 seconds, perhaps considerably less acceptable.

Let’s look at another example with a wider distribution of values.

Response time average Mk II
Response time average Mk II

Here our average would be a less stellar 6.8 seconds. But worse this average is considerably better than the response time received by the majority of our users with a heavy distribution of request times around 9, 10, and 11 seconds long. If, we were relying on the average alone, we’d probably think our application was performing a lot better than our users are experiencing it.

2.6.6.2 Median

At this point you might be wondering about using the median. The median is the dead center of our values: exactly 50% of values are below it, and 50% are above it. If, there are an odd number of values, then the median will be the value in the middle. For the first data set we looked at—3, 7, 12, 15, 22, 39, and 94—the median is 15. If, there were an even number of values, the median would be the mean of the two values in the middle. So, if we were to remove 39 from our data set to make it even, the median would become 13.5.

Let’s apply this to our two plots.

Response time average and median
Response time average and median

We see in our first example figure that the median is 3, which provides an even rosier picture of our data.

In the second example the median is 8, a bit better but close enough to the average to render it ineffective.

Response time average and median Mk II
Response time average and median Mk II

You can probably already see that the problem again here is that, like the mean, the median works best when the data is on a bell curve… And in the real world that’s not realistic.

Another commonly used technique to identify performance issues is to calculate the standard deviation of a metric from the mean.

2.6.6.3 Standard deviation

As we learned earlier in the chapter, standard deviation measures the variation or spread in a data set. A standard deviation of 0 means most of the data is close to the mean. Higher deviations mean the data is more distributed. Standard deviations are represented by positive or negative numbers suffixed with the sigma symbol—for example, 1 sigma is one standard deviation from the mean.

Like the mean and the median, however, standard deviation works best when the data is a normal distribution. Indeed, in a normal distribution there’s a simple way of articulating the distribution: the empirical rule. Within the rule, one standard deviation or 1 to -1 will represent 68.27% of all transactions on either side of the mean, two standard deviations or 2 to -2 would be 95.45%, and three standard deviations will represent 99.73% of all transactions.

The empirical rule
The empirical rule

Many monitoring approaches take advantage of the empirical rule and trigger on transactions or events that are more than two standard deviations from the mean, potentially catching performance outliers. In instances like our two previous examples, however, the standard deviation isn’t overly helpful either. And without a normal distribution of data, the resulting standard deviation can be highly misleading.

Thus far our methods for identifying anomalous data in our metrics haven’t been overly promising. But all is not lost! Our next method, percentiles, will change that.

2.6.6.4 Percentiles

Percentiles measure the values below which a given percentage of observations in a group of observations fall. Essentially they look at the distribution of values across your data set. For example, the median we looked at above is the 50th percentile (or p50). In the median, 50% of values fall below and 50% above. For metrics, percentiles make a lot of sense because they make the distribution of values easy to grasp. For example, the 99th-percentile value of 10 milliseconds for a transaction is easy to interpret: 99% of transactions were completed in 10 milliseconds or less, and 1% of transactions took more than 10 milliseconds.

Percentiles are ideal for identifying outliers. If a great experience on your site is a response time of less than 10 milliseconds then 99% of your users are having a great experience—but 1% of them are not. You can then focus on addressing the performance issue that’s causing a problem for that 1%.

Let’s apply this to our previous request and response time graphs and see what appears. We’ll apply two percentiles, the 75th and 99th percentiles, to our first example data set.

Response time average, median, and percentiles
Response time average, median, and percentiles

We see that the 75th percentile is 6 seconds. That indicates that 75% completed in 6 seconds, and 25% of them were slower. Still pretty much in line with the earlier analysis we’ve examined for the data set. The 99th percentile, on the other hand, shows 11.73 seconds. This means 99% of users had request times of less than 11.73 seconds, and 1% had more than 11.73 seconds. This gives us a real picture of how our application is performing. We can also use the distribution between p75 and p99. If we’re comfortable with 99% of users getting 11.73 second response times or better and 1% being slower then we don’t need to consider any further tuning. Alternatively, if we want a uniform response, or if we want to lower that 11.73 seconds across our distribution, we’ve now identified a pool of transactions we can trace, profile, and improve. As we adjust the performance we’ll also be able to see the p99 response time improve.

The second data set is even more clear.

Response time average, median, and percentiles Mk II
Response time average, median, and percentiles Mk II

The 75th percentile is 10 seconds and the 99th percentile is 12 seconds. Here the 99th percentile provides a clear picture of the broader distribution of our transactions. This is a far more accurate reflection of the outlying transactions from our site. We now know that—as opposed to what the mean response times would imply—not all users are enjoying an adequate experience. We can use this data to identify elements of our application we can potentially improve.

Percentiles, however, aren’t perfect all the time. Our recommendation is to graph several combinations of metrics to get a clear picture of the data. For example, when measuring latency it’s often best to display a graph that shows:

  • The 50th percentile (or median).
  • The 95th and 99th percentiles.
  • The max value.

The addition of the max value helps visualize the upward bounds of the metric you are measuring. It’s again not perfect though—a high max value can dwarf other values in a graph.

We’re going to apply percentiles and other calculations later in the book as we start to build checks and collect metrics.

2.7 Collecting data for our monitoring framework

In our framework we’re going to focus on agent-based collection of data. We’re going to prefer the running of local agents on hosts, and focus on the instrumentation of applications and services. Wherever possible each host or service will be self-contained and responsible for emitting its own monitoring data. We’ll locally configure collection and the destination of our data.

In keeping with our push-based architecture we’ll try to avoid remote checks of hosts and services. With a few exceptions—like external monitoring of hosts and applications—that we’ll discuss in Chapter 9, we’ll rarely poll hosts, services, and applications from remote pollers or monitoring stations.

Our data collection will include a mix of data:

  • Resource information, like consumption of CPU or memory
  • Performance information, like latency and application throughput
  • Business and user-experience metrics, like volumes or the amounts of transactions or numbers of failed logins
  • Log data from hosts, services, and applications

We’ll use much of the data and observations we collect directly as metrics. In some cases we’ll also convert observations in the form of events into metrics.

2.7.1 Overhead and the observer effect

One thing to consider when thinking about data collection is that the process of collecting the data can also impact the values being collected. In normal operation many of the methods we use to collect data will consume some of the resources we’re monitoring. Sometimes this overhead becomes excessive and can actually influence the state of what you’re monitoring, or worse, trigger notifications and outages. This is often called the observer effect, derived from the related physics concept. The methods we’re going to use will focus on making that overhead as small as possible, but you should remain conscious of the effect. Granular collections—for example, hitting an HTTP site or probing an API endpoint aggressively—could result in your monitoring check being a measurable percentage of the service’s capacity.

2.8 Summary

In this chapter we’ve articulated the framework we’re going to build to monitor our environment. We’ve talked about the push versus pull architecture and the focus on events and metrics. We’ve also discussed why we’ve chosen that architecture and what’s wrong with some of the monitoring alternatives out there. We then walked through an introduction to some of the monitoring and metrics principles we’re going to use throughout the book.

In the next chapter we launch our monitoring framework with the introduction of our event routing engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset