System-Wide Transparency

Back in Transparency, we saw how individual instances can reveal their state. That’s the start of a total story about transparency. Now we look at how to assemble a picture of system-wide health from the individual instances’ information.

The first place to start is by defining what we need from our efforts. When dealing with the system as a whole, two fundamental questions need to be answered:

  1. Are users receiving a good experience?

  2. Is the system creating the economic value we want?

Notice that the question, “Is everything running?” isn’t on that list. Even at small scale, we should be able to survive periods where everything isn’t running. At scale, “partially broken” is the normal state of operation. It’s rare to find all instances running with no deployments or failures at any given moment.

Real-User Monitoring

It is hard to deduce whether users are receiving a good experience from individual instance metrics. (It would require a model of the whole system that accounts for circuit breakers, caches, fallbacks, and a pile of other implementation details that change frequently.) Instead, the best way to tell if users are receiving a good experience is to measure it directly. This is known as real-user monitoring (or RUM, if you like).

Mobile and web apps can have instrumentation that reports their timing and failures up to a central service. That can take a lot of infrastructure, so you may consider a service such as New Relic or Datadog.[36][37] If you are at a scale where it makes sense to run it yourself, on-premise software such as AppDynamics or CA’s APM might be the thing for you.[38][39] Some of these products also allow you to watch network traffic at the edge of your system, recording HTTP sessions for analysis or playback.

Using these services has three advantages over the “DIY” approach. The first is rapid startup. You don’t need to build infrastructure or configure monitoring software. It is quite possible to get going with data collection in under an hour. Second, they offer agents and connectors for a wide array of technology, which makes it much easier to integrate all your monitoring into one place. Finally, their dashboards and visualization tend to be more polished than open-source alternatives.

There are downsides, of course. For one thing, these are commercial services. You’ll be paying a subscription fee. As your system scales, so will your fees. There may come a time when the fees become unpalatable, but the switching cost of moving to your own infrastructure is equally unpalatable. Second, some companies are absolutely unwilling to have even monitoring data crossing the Internet.

On-premise commercial solutions, such as AppDynamics, offer easy integration and polished visualization, but these lose the advantage of rapid startup and also have scaling fees.

The open-source arena has produced some excellent tools, but the usual open-source effect is at play: integrating the tools to your system can be a challenge. For that matter, integrating the tools with each other can be a challenge! The dashboards and visualization are also less polished and less user-friendly. While removing the very visible monthly fees for a service, the open source approach has less-visible costs in the form of labor and infrastructure.

Half of the vendors at operations or software architecture conferences are in this space, so the names may change by the time you read this. The broad category here is called “application performance management,” and it seems to be one of the last areas of operations software that hasn’t been replaced by open-source packages. As with other kinds of operations software, it’s not that important to choose the ideal solution. Instead, focus on adopting your chosen solution thoroughly. Don’t leave any “dead zones” in your system.

Real-user monitoring is most useful to understand in terms of the current state and recent history. Dashboards and graphs are the most common ways to visualize this.

Economic Value

Some software exists as art and some exists as entertainment. Most of the software we write for companies exists to create economic value. It may seem odd to be talking about the economics of software systems in a section about transparency, but this is where we can most directly perceive the linkage between our systems and our financial success. The value created by our systems can be harmed if the user experience is bad. It can also be harmed if the system cost is too high. These are the “top line” and “bottom line” effects. We should build our transparency in terms of revealing the way that the recent past, current state, and future state connect to revenue and costs.

The top line is income. Revenue. The good stuff. Our system should be able to tell us if we’re making as much as we “should be” right now. In other words, are there performance bottlenecks that prevent us from signing up more new users? Is some crucial service returning errors that turn people off before they register? The specific needs here vary according to your domain, but you should plan to watch the following:

  • Watch each step of a business process. Is there a rapid drop-off in some step? Is some service in a revenue-generating process throwing exceptions in logs? If so, it’s probably reducing your top line.

  • Watch the depth of queues. Queue depth is your first indicator of performance degradation. A non-zero queue depth always means work takes longer to get through the process. For many business transactions, that queuing time directly hits your revenue.

The bottom line is net profit (or loss). It is the top line minus costs. Cost comes from infrastructure, especially in these days of autoscaled, elastic, pay-as-you-go services. Nearly every startup has a horror story about unchecked autoscaling costing them thousands of dollars due to unchecked demand. Worse yet, that sometimes results from runaway automation spinning up too many resources.

Cost also comes from operations. The harder your software is to operate, the more time it takes from people. That’s true whether you’re in a DevOps-style organization or a traditional siloed organization. Either way, any time spent responding to incidents is unplanned work that could have gone to raising the top line.

Another less visible source of cost comes from our platforms and runtimes. Some languages are very fast to code in but require more instances to handle a particular workload. You can improve the bottom line by moving crucial services to technology with a smaller footprint or faster processing. Before you do, though, make sure it’s a service that makes a difference. In other words, your feature that detects birds in photographs taken inside national parks may require a lot of CPU time; but if it only gets used once a month, it’s not material to your bottom line.

So far we’ve talked about the current state and recent past. Our transparency tools should also help us consider the near future as well, such as these questions:

  • Are there opportunities to increase the top line by improving performance or reducing queues?

  • Are we going to hit a bottleneck that will prevent us from increasing the top line?

  • Are there opportunities to increase the bottom line by optimizing services? Can we see places that are overscaled?

  • Can we replace slow-performing or large-footprint instances with more efficient ones?

The idea of monitoring, log collection, alerting, and dashboarding as being about economic value more than technical availability may be unfamiliar. Even so, if you adopt this perspective, you’ll find that it is easy to make decisions about what to monitor, how much data to collect, and how to represent it.

The Risk of Fragmentation

The usual notion of perspectives splits into “technical” and “business” concerns. The “technical” perspective may even be split into “development” and “operations.” Most of the time, these constituencies look at different measurements collected by different means. Imagine the difficulty in planning when marketing uses tracking bugs on web pages, sales uses conversions reported in a business intelligence tool, operations analyzes log files in Splunk, and development uses blind hope and intuition. Could this crew ever agree on how the system is doing? It’d be much better to integrate the information so all parties can see the same data through similar interfaces.

Different constituencies require different perspectives. These perspectives won’t all be served by the same views into the systems, but they should be served by the same information system overall. Just as the question, “How’s the weather?” means very different things to a gardener, a pilot, and a meteorologist, the question, “How’s it going?” means something decidedly distinct when coming from the CEO or the system administrator. Likewise, a bunch of CPU utilization graphs won’t mean a lot to the marketing team. Each “special interest group” in your company may have its own favorite dashboard, but everyone should be able to see how releases affect user engagement or conversion rate affects latency.

Logs and Stats

In Transparency, we saw the importance of good logging and metrics generation at the microscopic scale. At the system scale, we need to gather all that data and make sense of it. This is the job of log and metrics collectors.

Like a lot of these tools, log collectors can either work in push or pull mode. Push mode means the instance is pushing logs over the network, typically with the venerable syslog protocol.[40] Push mode is quite helpful with containers, since they don’t have any long-lived identity and often have no local storage.

With a pull-mode tool, the collector runs on a central machine and reaches out to all known hosts to remote-copy the logs. In this mode, services just write their logs to local files.

Just getting all the logs on one host is a minor achievement. The real beauty comes from indexing the logs. Then you can search them for patterns, make trendline graphs, and raise alerts when bad things happen. Splunk dominates the log indexing space today.[41] The troika of Elasticsearch, Logstash, and Kibana is another popular implementation.

The story for metrics is much the same, except that the information isn’t always available in files. Some information can only be retrieved by running a program on the target machine to sample, say, network interface utilization and error rates. That’s why metrics collectors often come with additional tools to take measurements on the instances.

Metrics also have the interesting property that you can aggregate them over time. Most of the metrics databases keep fine-grained measurements for very recent samples, but then they aggregate them to larger and larger spans as the samples get older. For example, the error rate on a NIC may be available second by second for today, in one-minute granularity for the past seven days, and only as hourly aggregates before that. This has two benefits. First, it really saves on disk space! Second, it also makes queries across very large time spans possible.

What to Expose

If you could predict which metrics would limit capacity, reveal stability problems, or expose other cracks in the system, then you could monitor only those. But that prediction will have two problems. First, you’re likely to guess wrong. Second, even if you guess right, the key metrics change over time. Code changes and demand patterns change. The bottleneck that burns you next year probably doesn’t exist right now.

Of course, you could spend an unlimited amount of effort exposing metrics for absolutely everything. Since your system still has to do something other than just collect data, I’ve found a few heuristics to help decide which variables or metrics to expose. Some of these will be available right away. For others, you might need to add code to collect the data in the first place. Here are some categories of things I’ve consistently found useful.

Traffic indicators

Page requests, page requests total, transaction counts, concurrent sessions

Business transaction, for each type

Number processed, number aborted, dollar value, transaction aging, conversion rate, completion rate

Users

Demographics or classification, technographics, percentage of users who are registered, number of users, usage patterns, errors encountered, successful logins, unsuccessful logins

Resource pool health

Enabled state, total resources (as applied to connection pools, worker thread pools, and any other resource pools), resources checked out, high-water mark, number of resources created, number of resources destroyed, number of times checked out, number of threads blocked waiting for a resource, number of times a thread has blocked waiting

Database connection health

Number of SQLExceptions thrown, number of queries, average response time to queries

Data consumption

Number of entities or rows present, footprint in memory and on disk

Integration point health

State of circuit breaker, number of timeouts, number of requests, average response time, number of good responses, number of network errors, number of protocol errors, number of application errors, actual IP address of the remote endpoint, current number of concurrent requests, concurrent request high-water mark

Cache health

Items in cache, memory used by cache, cache hit rate, items flushed by garbage collector, configured upper limit, time spent creating items

All of the counters have an implied time component. You should read them as if they all end with “in the last n minutes” or “since the last reset.”

As you can see, even a medium-sized system could have hundreds of metrics. Each one has some range in its normal and acceptable values. This might be a tolerance around a target value or a threshold that should not be crossed. The metric is “nominal” as long as it’s within that acceptable range. Often, a second range will indicate a “caution” signal, warning that the parameter is approaching a threshold.

For continuous metrics, a handy rule-of-thumb definition for nominal would be “the mean value for this time period plus or minus two standard deviations.” The choice of time period is where it gets interesting. Most metrics have a traffic-driven component, so the time period that shows the most stable correlation will be the “hour of the week”—that is, 2 p.m. on Tuesday. The day of the month means little. In certain industries—such as travel, floral, and sports—the most relevant measurement is counting backward from a holiday or event.

For a retailer, the “day of week” pattern will be overlaid on a strong “week of year” cycle. There is no one right answer for all organizations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset