Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_6

6. Streaming Analytics

Insight from Data in Motion

Thomas W. Dinsmore¹

(1)Newton, Massachusetts, USA

Streaming analytics is the application of analytic operations to streaming data for applications such as:

Algorithmic trading
Customer interaction management
Intelligence and surveillance
Patient monitoring
Supply chain optimization
Network monitoring
Oil and gas optimization
Vehicle tracking and route monitoring

Market research firm Markets and Markets estimates¹ total spending on streaming analytics of $502 million in 2015, and predicts a 31% growth rate through 2020. While this is a sizeable market in its own right, keep in mind that it is only about 1% of worldwide spending on business analytics software.²

In this book, we use the term “real-time analytics” to describe a specific type of operation, described below. We use the term “streaming analytics” to describe the technology used in real-time analytics, and “low-latency analytics” to describe a desired outcome, minimized latency. The distinctions are important; while most examples of real-time analytics are low-latency, the reverse is not necessarily true.

Gartner defines “real-time” computing as:

The description for an operating system that responds to an external event within a short and predictable time frame. Unlike a batch or time-sharing operating system, a real-time operating system provides services or control to independent ongoing physical processes. It typically has interrupt capabilities (so that a less important task can be put aside) and a priority-scheduling management scheme. ³

In other words, real-time analytics :

Must be completed in a defined window of time.
Are initiated by an external process, such as incoming data.
Have priority over other operations.

A real-time operation is not “zero-latency”. Since any operation takes some time to complete, all operations entail some latency, even if only a millisecond. However, since real-time operations are initiated by an external process (the second criterion cited here), and have priority over other operations, they experience no waiting time attributable to queueing.

Real-time analytics aren’t distinguished by any absolute level of latency or time window for execution; the use case defines the time window. In an electronic market, an ultra-low latency trade is completed in less than one millisecond.⁴ In the field of plate tectonics an annual measure could qualify as “real-time”.

Ignoring this distinction produces odd debates over the precise definition for “real-time”:

Vivek Ranadivé, founder of TIBCO, a company whose DNA is real-time analytics, writes about “the two-second advantage”.⁵
Forrester analyst Mike Gualtieri asserts that “real-time” is anything less than one second latency.⁶
Capitol One’s Slim Baltagi argues⁷ that Apache Spark isn’t suitable for real-time analytics because it can “only” reduce latency to half a second.

Does real-time mean “two seconds or less,” “less than one second,” or “less than half a second”? The answer is possibly all of the above, and possibly none of the above. It all depends on the use case.

The potential of real-time analytics inspires hyperbole. Eric Woods of Navigant Research writes :

How well we deliver on the goal of real-time analytics will tell us much about the real level of maturity of our systems and our managerial structures. ⁸

He wrote that in 2002 .

Advocates for real-time analytics overstate its importance in the business analytics universe. Admiral Husband E. Kimmel, Commander in Chief of the United States Pacific Fleet at Pearl Harbor on December 7, 1941, had no shortage of real-time information about the Japanese attack. What he needed was a prediction.

We begin the chapter with a short history of streaming analytics, followed by a review of streaming fundamentals. In the third section of the chapter, we review popular streaming data sources, such as Apache Kafka and Amazon Kinesis, followed by a survey of the top open source streaming engines. We close the chapter with some examples of streaming analytics in action, and some observations about the economics of streaming.

A Short History of Streaming Analytics

As discussed in Chapter Two, the digital transformation of business creates new opportunities for analytics. In the 1980s, financial markets were just beginning to transition from the open outcry system, an auction system based on human traders in one location, to electronic trading. NASDAQ was the first U.S. market to do so, followed by the Chicago Mercantile Exchange.

Vivek Ranadivé, a 28-year old graduate of MIT and the Harvard Business School, saw an opportunity to speed the integration of systems, and founded Teknekron Software Systems in 1986. Teknekron developed a software bus to link software programs and transfer information between them at high speed, marketing the product under the Information Bus trademark . Teknekron described⁹ the product as:

Computer software to aid the process whereby one computer obtains data from another computer by receiving requests for data by subject matter, mapping the subject to a particular computer on a network which can supply that data and performing any necessary format conversion operations between incompatible formats so that the data received on the subject can be used by the requesting computer.

Goldman Sachs engaged¹⁰ Teknekron to develop a stock trading system. Fidelity Investments, First Interstate Bank, and Salomon Brothers invested in systems based on Teknekron’s Information Bus to integrate and deliver stock quotes, news, and other financial information to traders. Since most of the financial markets in the 1980s continued to operate through open outcry, investors and brokers focused on consolidating information for human traders rather than algorithmic trading.

Meanwhile, in the credit card business, fraud was a growing problem. Worldwide deployment of the Visa and MasterCard interchange systems created a growing need for real-time fraud detection. In 1992, HNC Software introduced its Falcon application, as discussed in Chapter Two. Falcon analyzed individual credit card transactions as they were presented for authorization, flagging suspects within the time window dictated by Visa/MasterCard rules.

HNC built Falcon on mainstream client-server technology (with an indexed file system for fast profile lookups). Falcon could analyze and disposition authorization requests at the rate of 120 transactions per second, which was more than adequate at the time.¹¹ Within ten years, Falcon handled 80% of the global Visa/MasterCard transaction authorizations.

Growing use of the World Wide Web created opportunities for real-time customer interaction management. While working at the University of Paris in the early 1990s, Dr. Khai Minh Pham developed a predictive modeling technique he called Agent Network Technology¹² (ANT ). In 1994, he founded DataMind, offering a product branded as DataCruncher; the company launched in 1996 with $4.7 million in venture capital and a patent on ANT .

In 1998, DataMind rebranded itself as Rightpoint and repositioned its product as a platform for real-time customer interaction management. Leveraging the “self-learning” ANT technology, Rightpoint offered customers capabilities for clickstream tracking, collaborative filtering, and real-time profiling to drive offer recommendations and personalized content. CRM vendor E.piphany Inc. acquired¹³ Rightpoint for $393 million ($562 million in 2016 dollars) in 1999.

While Khai Minh Pham developed ANT and launched DataMind, Stanford’s David Luckham proposed a research project in discrete event simulation to the Advanced Research Projects Agency.¹⁴ A key element of this project (named Rapide) was the ability to model patterns of concurrent events—in other words, the ability to infer higher-level events from many discrete events closely spaced in time.

Beginning in 1993, Luckham and his team developed the Rapide language and techniques to infer patterns or high-level events from streams of messages about more atomic events. This work culminated in 1998 with the publication of Complex Event Processing in Distributed Systems. ¹⁵

Simultaneously, at Cambridge University, John Bates¹⁶ developed a theory of complex event inference that was similar to Luckham’s. Lacking tools to describe and handle patterns of complex events, Bates and his team developed the necessary software. In 1999 Bates and Giles Nelson, a colleague, founded a company branded as APAMA to commercialize the software, targeting traders and investors in financial markets. (They sold¹⁷ the company to Progress Software in 2005 for $25 million.)

In the same year, a team of former Cray executives founded Aleri Group. Aleri offered a high-performance platform built on a vector database¹⁸, a technology that offered the throughput needed to manage and analyze high-speed high-volume trading data.

Vivek Ranadivé had sold¹⁹ Teknekron to Reuters in 1993 for $125 million ($206 million in 2016 dollars); with the proceeds, he founded TIBCO Software in 1997. Unlike APAMA and Aleri, TIBCO focused on middleware, the infrastructure that ties systems together and makes low-latency communications possible. Timed perfectly for the late 1990s internet boom, the company grew rapidly; revenue increased from $53 million in 1998 to $96 million in 1999. TIBCO’s initial public offering at $10 in 1999 valued the company at more than $300 million.

Wall Street fell in love with TIBCO’s stock, driving it up to 12 times the IPO price by the summer of 2000. Then, in the wake of the internet bust, the stock fell to single digits by 2001.

In that year, Michael Stonebraker of MIT, together with scientists from Brandeis and Brown, started work on the Aurora project for data stream management. The team designed Aurora to handle large numbers of asynchronous push-based data streams (in contrast to relational databases, from which users “pull” data with discrete requests). Aurora represented streaming processes as a directed graph that mapped the flow of data from sources, through streaming operators, and then to consuming applications.²⁰

Aurora users built continuously operating queries with standard filtering, mapping, windowing, and join operations. Windowed operations supported timeout and slack parameters enabling the engine to handle slow and out-of-order operations.

Stonebraker and others founded StreamBase Systems in 2003 to commercialize an enterprise-grade version of the Aurora engine. Like Aurora, StreamBase uniquely integrated streaming and SQL operations into a single platform. Backed with $5 million in venture capital, StreamBase released²¹ its first products in August 2004 and closed²² an $11 million “B” round in January 2005, intending to sell to investment banks, hedge funds, and government agencies.

Financial markets transformed rapidly in this period. The U.S. Securities and Exchange Commission authorized electronic trading in regulated securities in 1998. Electronic trading created opportunities for algorithmic trading to detect and execute trades exploiting short-lived opportunities. These included arbitrage between markets, arbitrage between indexes and the underlying stocks, and licit and illicit tactics.

Reducing latency in market trades makes markets more efficient. However, trading itself is a zero-sum game, with benefits accruing to the trader that can accumulate and act on information faster than all other traders. Between 1999 and 2010, market participants invested heavily in the tools and infrastructure needed to drive latency out of their trading operations, in a kind of an arms race. The time needed to complete trades declined to milliseconds and even microseconds. In 2010, an executive of the Bank of England predicted that trading time would decrease to nanoseconds.²³

The key players in streaming analytics profited from growing interest in low-latency analytics. Through the 2000s, TIBCO steadily delivered low latency infrastructure to an expanding list of industries: mobile telecommunications; airlines, for baggage handling, ticketing, and check-in; insurance, for handling claims; and to the gaming industry. Amazon.com adopted TIBCO middleware to support its recommendation engine, and FedEx deployed it for package tracking. Through 2009, TIBCO’s revenue grew by double digits.

In the same period, under management by Progress Software, APAMA increased its presence on Wall Street for algorithmic trading, as well as in retail banking, telecommunications, logistics, government, energy, and manufacturing. APAMA’s strong point was a visual interface that made it easy for business analysts to set up and run streaming applications.

The streaming analytics industry began to consolidate in the late 2000s. In 2009, Aleri merged with Coral8, a competing CEP vendor. The combined entity offered a suite of software for liquidity management, low latency trading across markets, low latency risk management, and stress testing. Database vendor Sybase acquired the assets of Aleri in March 2010 and enterprise software vendor SAP acquired²⁴ Sybase two months later.

IBM entered the streaming analytics market in 2009. IBM Research spent years developing²⁵ the streaming platform it called “System S ”. Touting the system under the neologism “perpetual analytics”—a concept that did not stick—IBM released²⁶ the product branded as IBM Infosphere Streams.

For high performance and throughput, IBM designed Streams to distribute workload over clustered servers. IBM invented²⁷ two new languages for streaming analytics—Stream Processing Application Declarative Engine (SPADE ²⁸) and Mashup Automation with Runtime Invocation & Orchestration (MARIO)—neither of these has gained any acceptance outside of IBM. Infosphere Streams scored well on performance and scalability tests, and it supported comprehensive operators and development tools in an Eclipse-based integrated development environment.

Despite its technical strengths, StreamBase struggled to compete in the narrow niche of real-time analytics. After stating that its 2005 venture funding would be its last private round, StreamBase closed another round in 2007, then accepted a “down” round in 2009 shortly after the departure of CEO Barry S. Morris. In 2013, TIBCO acquired StreamBase Systems for $49.7 million.²⁹

Two days later, Progress Software sold APAMA to Software AG for $44 million.³⁰ Software AG supplemented the core APAMA software with a suite of tools for order routing, pre-trade risk, and other building blocks for capital markets solutions; a separate scoring engine that works with predictive models trained offline and imported through PMML; and a dashboarding application.

As of 2016, Software AG, IBM, SAP, TIBCO, and Oracle dominate the commercial market for streaming analytics. (Oracle entered the market through its acquisition of BEA Systems in 2008.) Forrester rates³¹ all five as “leaders” in its annual market survey, together with startups SQLstream and DataTorrent.

Fundamentals of Streaming Analytics

In this section, we cover the basics of streaming analytics: the definition of stream processing, streaming operations, Complex Event Processing, streaming machine learning, and anomaly detection.

Stream Processing

The term “streaming analytics” combines two concepts: streaming data processing and low-latency analytics. Streaming data processing is a systems paradigm or architecture that processes data as it arrives. Low-latency analytics is a desired outcome, the delivery of insight about events as soon as possible after those events occur in the real world. Organizations use streaming data processing as a means to accomplish the end of low-latency analytics .

We contrast streaming data processing with batch processing. Under batch processing, programs work with finite sets of data gathered in discrete sets, or batches. The program runs until it finishes handling the data included in the batch, then terminates.

A batch process can run on a fixed schedule, such as every night at midnight. It can also run whenever the accumulated data reaches a certain threshold, or it can run on an ad hoc schedule under the manual control of an operator.

Batch processing is efficient, but it builds latency into the process. In a batch process, the total latency equals the amount of time an arriving record waits in a queue, plus the time needed to perform operations on the data. This latency can be highly visible when the information or insight is critical for the operations of the organization.

Streaming data processing works in a different manner. In streaming data processing, programs handle data as it arrives, one unit at a time. Once the data stream starts, processing continues as fast as the data arrives and continues without a predetermined end.

Since a streaming process runs continuously, without startup or cleanup, it must be fault-tolerant, with the ability to transfer processing from a failed node to a working node, and the ability to reconstruct data when a process fails .

In addition to reduced latency, a well-designed streaming process spreads workload over the day. In contrast, a batch process requires workload “spikes” that may be more difficult to manage. Organizations can schedule batch operations to run in off-peak periods; but as organizations use cloud computing and virtualization to shift workloads, there are fewer off-peak periods when infrastructure is idle.

Streaming systems must enforce one of three processing semantics:

At-least-once processing ensures that no messages sent by the source system will be omitted by the receiving system.
At-most-once processing ensures that no messages will be duplicated in the receiving system.
Exactly-once processing ensures that each message sent by the source system is captured in the receiving system once and only once.

We use the term “streaming analytics engine” to characterize streaming platforms with the ability to perform analytic operations (defined later in this chapter) and without a storage capability. Thus, a streaming analytics engine accepts streaming input, performs an operation, and passes the result to some other application for storage or use.

The ability to perform analytic operations distinguishes a streaming analytics engine from a streaming data source, which simply collects and forwards streaming data. However, we expect the two categories to converge, as popular streaming data sources add analytics capabilities.

Databases can also have streaming capabilities. However, in a 2004 paper³² MIT’s Michael Stonebraker detailed the important differences between a relational database and a purpose-built streaming engine. Stonebraker’s team compared throughput for a workflow with 22 operators on StreamBase and on a leading relational database. StreamBase processed data at the rate of 160,000 records per second; by comparison, the team could achieve only 900 records per second with the relational database .

Stonebraker attributed the extreme performance difference to the relational database’s mandatory storage operation for incoming records. StreamBase made the initial storage optional, so that incoming records could be processed in memory, then either stored or passed to another application. In short, the sequence of operations matters; a database that stores data first, then performs calculations will appear to be slower than an in-memory engine that simply performs calculations and passes data to another application for storage. Of course, the total processing time from data receipt to storage matters; but for some applications, it makes sense to perform calculations first and store the data in background.

Databases have evolved a great deal since 2004; many today permit in-memory pre-processing before storage. Moreover, the growth of in-memory databases, covered in Chapter Five, tends to blur the distinction between pure streaming engines and databases. Startups MemSQL and VoltDB, for example, position their in-memory databases in the market for streaming analytics.

Streaming Operations

A conventional relational database processes a query until it reaches end-of-table indicators for all tables referenced in the query. In a streaming database, there is no equivalent end-of-table concept, since the database updates continually as new records arrive.

Many of the operations analysts seek to perform on streams are the same as those they perform on static tables. Some, however, are unique to streaming data: they include joining streams, aggregations, filtering, windows, and alerts.

Joining Streams. For insight, analysts may need to join multiple streams to one another. For example, a vehicle fleet operator may have streams of data arriving for each vehicle in the fleet; to monitor fleet-level statistics, all of the individual streams must be joined into a single stream.

For context, analysts may also need to join streams to static tables. For example, suppose that we are working with a stream of transactions posted by hundreds of retail stores, which we want to group by region. The transaction records in the incoming stream have a store code but not a region code; for that, we must join the stream to a static store master table to capture the region code.

Aggregations. A key capability of streaming SQL is to the ability to compute and retain aggregates on the incoming stream. For example, we may want to compute a cumulative count and sum of transactions as they arrive. Aggregations are useful when combined with windowing, so the computed measures correspond to statistics for discrete and finite time intervals.

Filtering. Filtering a stream of data is conceptually similar to filtering a static data set. We filter for two reasons:

To remove noise and irrelevant data.
To limit the scope of the analysis to a specific subset of the data.

In the first case, the stream of data may include test records, incomplete transactions, miscoded data, or other kinds of “garbage” that simply adds noise to our analysis .

Business questions rarely require the entire universe of data available. Instead, we typically seek information about specific products, stores, people, customers, geographies, and time periods (or complex combinations of all of these attributes).

Windowing. Streaming data arrives continuously at arbitrary time intervals. For insight and analysis, however, end users want to see statistics for fixed time intervals: seconds, minutes, hours, or some other interval. Windowing functions enable users to define a time period, or window, and the data to include in the window. Analysts can use statistics aggregated through windowing for cumulative totals, moving averages, and other more sophisticated analysis.

Alerts. Alerts are arguably the most important streaming operator. Streaming analytics theory holds that (a) information becomes less valuable as it ages, and (b) information is valuable only if it is actionable. Thus, defining and tuning alerts is a central task for any streaming analytics system.

There are three kinds of rule-based alerts, in increasing complexity:

Alerts based on fixed rules, universally applied. For example, “select all transactions with an amount greater than $1,000”.
Alerts based on rules that are differentiated by groups or entities within the population. For example, “select all transactions greater than two times the mean transaction value for this customer”.
Alerts based on rules that are differentiated by groups or entities over time. For example, “select all transactions greater than two times the mean transaction value for this customer in the past month”.

In any streaming system, defining and tuning alerts require careful balancing of false positives and negatives. Alerts drive actions, such as fraud investigations, transaction declines, or tax audits; some actions are expensive, or they can adversely impact customer goodwill.

False positives are like false alarms: the system focuses attention on a transaction that, upon further investigation, requires no action. False negatives are lost opportunities; the system fails to focus attention on a transaction that should have been acted upon.

If we define alerts too narrowly, the system produces too few alerts, and there are many false negatives. In a credit card fraud detection system, for example, the result will be high fraud losses .

On the other hand, if we define alerts too broadly, the system produces too many alerts, many of which are false alarms; the system loses credibility with its users. In a credit card fraud detection system, the result will be angry customers and overworked investigators.

In practice, alerts based on streaming data must be carefully designed, developed, tested, validated, tuned, and monitored. These needs drive many of the supporting tools and capabilities of commercial streaming analytics platforms.

Events in a stream rarely provide useful insight in isolation; we need context to distinguish important events. Consider the following example:

At 11:41 am on Saturday, John Doe presents his credit card at a particular store in Chicago for a $500 purchase.

If we are interested in detecting credit card fraud, this transaction alone tells us very little. But now consider the following context:

In the past 12 months, a high percentage of the credit card transactions at this store were fraudulent.
For stores of this type, transactions of more than $100 have a higher incidence of fraud.

Adding some historical information about the merchant and merchant category raises our concern about the transaction. Checking Mr. Doe’s profile, we discover that he lives in Philadelphia, which might be enough to trigger a request to the store cashier to verify the customer’s identity. Now consider the following additional fact:

At 11:01 am today, Mr. Doe presented the same card at a gas station in Philadelphia.

Since it is not possible for a customer to present the same credit card in two widely separated cities, we now know that at least one of the cards is fraudulent, and we can take further action.

The example demonstrates a key principle of streaming analytics: for insight, we must combine a streaming fact with other streaming facts and with stored contextual data. Operations on individual streaming facts alone produce trivial results.

In a related example, suppose that we are interested in customer loyalty and retention for an online bank. Every day for the past three years, Jane Doe has logged in to check her balance. On March 1, she does not log in. Like Sherlock Holmes’ dog that did not bark³³, it is the absence of a transaction that matters, and we are only able to understand this because we combine streaming facts with history.

We cite these two examples to demonstrate that for most business analytics, streaming data cannot be separated from historical data; both must be used together. Reflecting this key point, tools for the analysis of streaming data and static data are converging; instead of distinct tooling for streaming data, we see streaming operators within analysis tools that can work with both types of data .

Complex Event Processing

We noted earlier in this chapter, in the historical survey, that Complex Event Processing (CEP ) emerged in the 1990s. The streaming analytics startups that emerged at this time generally featured CEP as an organizing principle for interactions with users.

Gartner defines CEP as :

A kind of computing in which incoming data about events is distilled into more useful, higher level “complex” event data that provides insight into what is happening. CEP is event-driven because the computation is triggered by the receipt of event data. CEP is used for highly demanding, continuous-intelligence applications that enhance situation awareness and support real-time analytics. ³⁴

Interest in CEP peaked around 2010, as shown in Figure 6-1.

Figure 6-1. Google search interest in CEP (Source: Google Trends)

CEP is an analytic framework that enables inference of high-level patterns or events from multiple streaming data sources. For example, in capital markets a trader might seek to use CEP to infer buy and sell signals for a security from streaming news feeds, text messages, social media, market feeds, weather reports, and other data, collectively called an event cloud.

Suppose, for example, that we have a large number of sensors mounted on a Formula One racing car that measure such things as oil pressure, water pressure, exhaust particulates, and power output every fraction of a second. We want to detect a blown engine as early as possible, so we can automatically shut down other systems to avoid damage and inform the driver to steer the car to safety .

From analysis, we know that a rapid drop in oil pressure and water pressure combined with reduced engine power output and increased particulates in the exhaust mean that the engine is blown. With CEP, we can model those relationships; then, we can use CEP software to build an automatic shutdown procedure using data streaming from the sensors .

CEP is not an algorithm, but a conceptual model for a class of problems where useful patterns require combining information across different sources of streaming data, and where events of interest are defined in time. CEP does not tell the user what the relationships are among events; it simply allows the user to describe them. In this respect, CEP is comparable to SQL; it enables the user to express a pattern and generate data accordingly, but does not help the user discover patterns.

In theory, analysts can use machine learning to infer relationships between high-level events and detailed data. In practice, most existing applications depend more on rule-based inference, because rules are easier for business users to understand.

Streaming Machine Learning

More often than not, when managers speak about machine learning with streaming data, they mean scoring with streaming data. In other words, they want to use a predictive model trained with static data to generate predictions with streaming data. In low latency scoring, the model itself remains stable; we simply seek to apply a stable model to produce a prediction with minimal latency. Organizations use such predictions in a variety of automated decisions, such as scoring live credit card transactions for fraud risk.

Most streaming engines and low latency decision engines can ingest PMML models, and can also support predictive model pipelines as custom code. From the model developer’s perspective, training predictive models that will be deployed with streaming data is no different from any other deployment scenario, although the model developer may need to be mindful of what data will actually be available in a low latency environment.

Evolutionary machine learning algorithms that learn continuously from streaming data represent an entirely different type of model. An evolutionary model is appropriate when the process we seek to model is not stable over time. Such algorithms are rarely, if ever, used to support business decisions, due to legal, regulatory, and management concerns. Actual applications at present tend to be limited to experimental use cases, or where the analytics will be used for insight and discovery.

The two key issues for evolutionary models are data recency and the time window. By recency, we mean how quickly new transactions enter the model training data and the model itself adjusts to the new observation. The time window of the model is the amount of history included in the training data; models can work with a sliding window (such as the last 24 months of data), a fixed window (all data captured after January 1, 2014), or all data ever captured.

Evolutionary models with very long time windows produce results that are very similar to static models. Static models that are updated frequently produce results that are very similar to evolutionary models unless the process we seek to model is highly unstable—in which case the value of predictive modeling itself is called into question .

On paper, at least, there are incremental versions of support vector machines³⁵, neural networks,³⁶ and Bayesian networks³⁷. There are actual implementations of incremental versions of k-means clustering³⁸ and linear regression³⁹ in Apache Spark.

Anomaly Detection

Anomaly detection is the identification of items in a stream that do not conform to an expected pattern or to other items in the stream. Technically, anomaly detection is not limited to stream processing and can be applied to batches of data as well. In practice, however, organizations use anomaly detection for low-latency applications, such as network security, where the goal is to evaluate events as soon as possible.

Like CEP, anomaly detection is neither a precisely defined technique nor a specific use case; it is a generic application that can be applied to a broad range of business problems, including fraud detection, health care quality assurance, operations management, and security applications. Machine learning techniques suitable for anomaly detection include:

Supervised learning techniques, where anomalous events are well-defined and we have a set of examples we can use to train a model.
Unsupervised learning techniques, where we cannot define anomalies in advance and simply seek to identify cases that are different .

Practitioners have successfully used k-nearest neighbor, support vector machines, neural networks, clustering, association rules, and ensemble models to build anomaly detection applications. Anomaly detection systems based on unsupervised learning generate alerts and route them to human analysts for investigation and disposition .

Streaming Data Sources

The streaming data platforms detailed in this section differ from the low-latency analytics platforms covered later in this chapter because they lack analytics operators. Instead, they are designed to serve as brokers between source systems that generate data and analytic systems that consume data.

We note that the distinction between streaming data sources and streaming engines may blur in the future, as popular data sources add analytics capability. Amazon Web Services has announced Kinesis Analytics (planned availability is late 2016).

Apache ActiveMQ

Apache ActiveMQ is a Java-based open source message broker. ActiveMQ entered Apache incubation in December 2005 and graduated to top-level status in February 2007. ActiveMQ is widely used and embedded in at least 16 other Apache projects.

Red Hat distributes JBoss A-MQ , a commercially supported version of ActiveMQ. Several other companies offer training, consulting, and support for the generic open source version.

ActiveMQ’s code base expanded⁴⁰ rapidly until late 2010 and has expanded slowly since then .

Apache Kafka

Apache Kafka is an open source message-brokering software project. A team at LinkedIn developed the original code; LinkedIn contributed the code to the Apache Software Foundation in 2011. Kafka graduated to top-level Apache status in October 2012.

Kafka offers very high throughput for streaming data. Individual Kafka servers (“brokers”) can handle hundreds of megabytes of reads and writes per second from thousands of sources.

Kafka has a distributed scale-out architecture for durability and fault-tolerance. The software saves and replicates messages within the cluster, so it can reconstruct messages if a node fails.

A single Kafka cluster can serve as the central data backbone for a large organization. The cluster can be expanded and contracted without downtime. Kafka partitions data streams and spreads them over a cluster of machines to support data streams that are too large for any single machine to handle.

Some of the ways that organizations use⁴¹ Kafka include:

LinkedIn uses Kafka to handle activity stream data and operational metrics and to power products such as LinkedIn Newsfeed and LinkedIn Today.
Netflix uses⁴² Kafka for its low-latency event processing pipeline.
Spotify uses Kafka to ingest 20 terabytes of data daily as part of a log delivery system.
Square uses Kafka to move systems events, including metrics and logs, through a low-latency pipeline to consuming systems like Splunk and Graphite .
Cisco’s OpenSOC⁴³ project seeks to develop an extensible and scalable advanced security analytics tool. OpenSOC uses Kafka to collect streaming data from traffic replicators and telemetry sources and pass the data along to Apache Storm for low-latency analytics .
Retention Science uses Kafka to collect and handle clickstream data.

Kafka has a relatively small code base⁴⁴, but contributions have accelerated markedly since mid-2015.

Confluent, a startup founded in 2014, leads Kafka development and offers training, consulting, and commercial support.

Amazon Kinesis

Amazon Kinesis is an Amazon Web Services (AWS) platform for streaming data. The service enables users to load and analyze streaming data and to build streaming applications. The platform includes three services:

Amazon Kinesis Firehose, a basic service for handling streaming data.
Amazon Kinesis Analytics, a SQL service for streaming data (planned availability is late 2016).
Amazon Kinesis Streams, a service for building applications that handle streaming data.

Amazon Kinesis Firehose captures and automatically loads streaming data into Amazon S3 and Amazon Redshift. AWS offers Firehouse as a managed service that scales up and down automatically to handle variation in data volumes with consistent throughput.

Amazon Kinesis Streams enables users to build custom applications that continuously capture and process data from sources such as web site clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events. AWS offers a library of prebuilt applications for common tasks such as building low-latency dashboards, alerts, dynamic pricing, and so forth. Users can also transmit from Kinesis to other AWS services such as S3, Redshift, Amazon Elastic Map Reduce (EMR), and AWS Lambda .

RabbitMQ

RabbitMQ is an open source message broker that implements a standard called the Advanced Message Queing Protocol (AMQP) on the Open Telecom Platform (OTP) clustering framework. Rabbit Technologies, Limited, a subsidiary of Pivotal Software, leads development and provides commercial support.

RabbitMQ has an active list of contributors⁴⁵ and a gradually expanding code base.

Streaming Analytics Platforms

As of early 2016, there are five Apache open source projects that support streaming analytics: Apache Apex, Apache Flink, Apache Samza, Apache Spark (Streaming), and Apache Storm. Apex, Flink, Samza, and Storm are “pure” streaming engines, while Spark is a general-purpose analytics platform with streaming capabilities. The number of projects reflects the rapidly growing interest in streaming among prospective users and contributors.

There are also streaming tools available in R, Python, and other open source libraries. We focus on the Apache projects because they all support fault-tolerant distributed computing.

Apache Apex

Apache Apex is an the open source version of a streaming and batch engine originally developed by DataTorrent, a commercial venture founded in 2015. Most of the code commits to Apex come from DataTorrent employees. Apex entered Apache incubator status in August 2015 and graduated⁴⁶ to top-level status in April 2016.

To supplement core Apex, its developers created the Malhar library, which includes operators that implement common business logic functions needed by customers who want to quickly develop applications. These include:

Access to file systems, including HDFS, S3, and NFS.
Integration with message brokers, including Kafka, ActiveMQ, RabbitMQ, JMS, and other systems.
Database access, including connectors to MySQL, Cassandra, MongoDB, Redis, HBase, CouchDB, and other databases along with JDBC connectors

The Malhar library also includes common business logic patterns that help users reduce the time it takes to go into production.

In its proposal to the Apache Software Foundation, Apex, sponsors differentiate the project by arguing that applications written for non-Hadoop platforms typically require major rewrites to get them to work with Hadoop .

This rewriting creates a significant bottleneck in terms of resources (expertise), which in turn jeopardizes the viability of such an endeavor. It is hard enough to acquire Big Data expertise; demanding additional expertise to do a major code conversion makes it a very hard problem for projects to successfully migrate to Hadoop. ⁴⁷

The Apex team reports no current production users. After growing rapidly until early 2014, the code base⁴⁸ has been largely static since then.

Apache Flink

Apache Flink is an open source distributed dataflow engine written in Scala and Java. Flink’s runtime supports batch and stream processing, as well as iterative algorithms.

Dataflow programming is an approach that models a program as a directed graph of the data flowing between operations. Dataflow programming focuses on data movement and models programs as a series of connections.

Flink does not have a storage system, so input data must be stored in a file system like HDFS or HBase, or it must come from a messaging system like Apache Kafka .

Researchers at the Technical University of Berlin, Humboldt University of Berlin, and the Hasso-Plattner Institute started a collaborative project called Stratosphere⁴⁹ in 2010. After three Stratosphere releases, the consortium donated a fork of the project to the Apache Software Foundation , which accepted the project for Incubator status in March 2014. In December 2014, Flink graduated to top-level status.

Some of the ways that organizations use Flink include:

Capital One uses Flink for low-latency customer activity monitoring.
Bouygues Telecom uses Flink for low-latency event processing and analytics.
ResearchGate uses Flink for network analysis and duplicate detection.

Other applications include semantic Big Data analysis and inference for tax assessment and research on distributed graph analytics.

Flink includes several modular libraries, including:

Gelly, a Graph API for Flink, with utilities that simplify the development of graph analysis applications.
ML, a machine learning library.
Table, an API that supports SQL-like expressions.

DataArtisans, a startup based in Berlin, Germany, leads development, provides commercial support, and organizes the Flink Forward conference. Flink is not currently supported in any commercial Hadoop distribution .

Flink’s code base⁵⁰ grew steadily until mid-2015, but has been largely flat since then .

Apache Samza

Apache Samza is a computational framework that offers fault-tolerant, durable, and scalable stateful stream processing with a simple API. It uses Kafka for messaging and runs under YARN. A team at LinkedIn developed Samza together with Kafka to support stream processing use cases. LinkedIn donated⁵¹ the project to open source in 2013; it entered Apache incubator status in July 2013, and graduated to top level status in January 2015.

Some of the ways that organizations use Samza include:

LinkedIn uses Samza to process tracking and service log data and for data ingestion pipelines.
Intuit uses Samza to enrich events with more contextual data from various sources to aid operations personnel.
Metamarkets uses Samza to transform and join low-latency event streams for interactive querying.
Uber uses Samza to aggregate metrics, database updates, fraud detection, and root cause analysis.
Netflix uses Samza to route more than 700 billion events per day from Kafka to Hive.

Other applications include low-latency analytics, multi-channel notification, security event log processing, low-latency monitoring of data streams from wearable sensors for healthcare management, and social media analysis .

Samza is not currently supported by any of the Hadoop vendors. Contributor activity⁵² is low.

Apache Spark Streaming

We discussed Apache Spark’s streaming capabilities in Chapter Five, under in-memory analytics.

Apache Storm

Apache Storm is an open source low-latency computing system. A team at a startup named BackType wrote Storm in the Clojure programming language. When Twitter acquired BackType, it released Storm as an open source project. Storm entered Apache incubator status in September 2013 and graduated to top-level status in September 2014.

Storm applications express data transformations as a directed acyclic graph, where the vertices or nodes represent data sources or transformations and the edges represent streams of data flowing from one vertex to the next. Nodes that represent streaming data sources are called spouts; nodes that process one or more input streams and produce one or more output streams are called bolts; the complete network of spouts, bolts, and streams is called a topology.

Storm’s messaging interface is sufficiently flexible that it can be integrated with any source of streaming source. Well-documented queue integrations currently include Kestrel, RabbitMQ, Kafka, JMS, and Amazon Kinesis.

Topologies are inherently parallel and run across a cluster of machines.⁵³ Different parts of the topology can be scaled individually by manipulating their parallelism. Storm’s inherent parallelism means it can process very high throughputs of messages with very low latency .

Storm is fault-tolerant: if one or more processes fails, Storm will automatically restart it. If the process fails repeatedly, Storm will reroute it to another machine and restart it there.

Under its standard configuration, Storm guarantees “at-least-once” processing, which ensures that every incoming message will be processed. For exactly-once processing—a key requirement in financial systems—Storm supports an overlay application called Trident.

Once started, Storm applications run indefinitely and process data in low latency as it arrives.

Some of the ways that organizations use⁵⁴ Storm include:

Groupon uses Storm to build low-latency data integration systems that can analyze, clean, normalize, and resolve large quantities of data.
Spotify uses Storm for low-latency music recommendations, monitoring, analytics, and ad targeting.
Cerner uses Storm to process clinical data in low latency.
Taobao uses Storm to extract useful information from machine logs.
TheLadders uses Storm to send hiring alerts; when a recruiter posts a job, Storm processes the event and aggregates job seekers who match the required profile.

Other applications include synchronizing contact lists, systems monitoring, trending topic detection, sentiment analysis of social media, security monitoring, and many others.

Hortonworks and MapR distribute and support Storm. Code contributions⁵⁵ have accelerated markedly since mid-2015.

Streaming Analytics in Action

Organizations currently use streaming analytics for a variety of use cases, including risk management, telco operations, for basic science, and for medical research, among others. Here are seven examples.

Credit Card Fraud.⁵⁶ Shopify offers an ecommerce platform as a service for online stores and retail point-of-sale systems. As of late 2015, the company reports that its platform supports more than 200,000 merchants and $12 billion in gross sales at a rate of 14,000 events per second. Processing credit card transactions is risky, and the company has just seven analysts to investigate possible fraud. Shopify processes transactions from Apache Kafka in Spark Streaming, filtering the riskiest transactions and routing them to case management software for investigation.

Credit Card Operations.⁵⁷ To detect unusual behavior, ING clusters venues (stores) according to usage patterns, then monitors the stream of transactions with Spark Streaming to identify venues whose cluster assignment changes. Analysts investigate anomalies to determine causes of the unusual behavior.

Customer Experience Management.⁵⁸ Capital One, a leading U.S. consumer and commercial banking institution, has an overall technology strategy that seeks to shift data processing from batch operations to stream processing. In its digital operations, the bank monitors customer activity in low latency to proactively detect and resolve issues, prevent systems issues from adversely impacting the customer, and to enable a flawless digital experience. Capital One previously used expensive proprietary tools that offered limited capabilities for low-latency advanced analytics. The bank developed a new system that uses Apache Flink to process events from Apache Kafka and generate low-latency alerts, time-window aggregates, and other operations. Flink also provides the bank with the ability to perform advanced windowing, event correlation, fraud detection, event clustering, anomaly detection, and user session analysis.

Telco Network Monitoring.⁵⁹ Bouygues Telecom is one of the largest communications providers in France, with more than 11 million mobile subscribers, 2.5 million fixed line customers, and revenue of more than 5 billion euros. Bouygues’ Logged User Experience (LUX) system captures massive quantities of log data from network equipment and generates low-latency diagnostics and alerts. Flink collects log events from Apache Kafka at an average rate of 20,000 events per second, transforms the raw data into a usable and enriched format, and returns it to Kafka for additional handling. Flink also generates alarms if it detects failures exceeding a threshold.

SK Telecom, South Korea’s largest wireless carrier, offers⁶⁰ another example. SKT captures 250 terabytes of network logs per day, which it loads into a Hadoop cluster that now has more than 1,400 nodes. For low-latency analytics, the company uses Spark Streaming to capture events from Kafka and produce low-latency metrics of network utilization, quality, and fault analysis.

Neuroscience.⁶¹ The Freeman Lab at Howard Hughes Medical Institute’s Janelia Research Campus explores neural computation in behaving animals at the scale of large populations and entire brains. To facilitate this work, the lab has developed three open source packages: thunder⁶², a large-scale imaging and time series analysis tool that runs on Apache Spark; lightning⁶³, a tool that produces web visualizations; and binder⁶⁴, software for reproducible computing with Jupyter and Kubernetes. Zebrafish brains have about 100,000 neurons (compared to human brains, which have 100 billion neurons). A scanning electron microscope working with a zebrafish brain produces two terabytes of data an hour. To measure response to various stimuli, researchers at the lab use low-latency analytics to capture key metrics, perform dimension reduction, clustering, and regression analysis, with results piped into visualization tools.

Medical Research.⁶⁵ In partnership with Intel and a consortium of hospitals, rehabilitation clinics, and clinical trial providers, the Michael J. Fox Foundation conducts research into Parkinson’s Disease. A key challenge for researchers is a lack of objective measurements of the physical symptoms of the disease, including tremors; these symptoms progress slowly, and changes are hard to detect. With wearable devices and a stack of open source components that includes Apache Kafka and Apache Spark Streaming, scientists can monitor patient activity, symptoms, and sleep patterns, and they can correlate these with medication intake.

Streaming Economics

How important is low latency in analytics? In this and previous chapters, we identified two proven examples of disruptive analytics where reducing latency was a critical success factor:

In financial markets, trading algorithms operate in a Darwinistic world, where microseconds matter.
In fraud detection, credit card issuers must detect fraud within a narrow window of time or absorb the loss.

We also presented examples of streaming analytics at work in a number of fields: credit card fraud detection, credit card operations, customer interaction management, medical research, and telco operations. Most of the examples are operational, not strategic, and they appear to be disruptive. None of the applications is disruptive, as defined in Chapter One, with the possible exception of the medical research example, which could have a profound impact on research into Parkinson’s Disease.

Overall, the economic impact of streaming analytics is very small outside of some clear-cut use cases. At the beginning of this chapter, we noted that spending on software for streaming analytics is only about 1% of total spending on software for business analytics. Spending on streaming analytics is growing faster than for the category as a whole, but even at the most optimistic projection, it’s not likely to account for more than 2% of the category in 2020. That is a niche market.

TIBCO, with its focus on a broad range of tools to reduce latency, was able to build a billion dollar business over a decade. Other players in the market were not so successful:

When Sybase acquired Aleri in 2010, it acquired “certain assets” of the company, and not the company as a whole—which implies that Aleri was failing and no longer a going concern.
Progress Software acquired APAMA for $25 million in 2005 and sold it in 2013 for $44 million, a modest gain. But revenue from the product line declined by 61% in the two years prior to the sale. Keep in mind that APAMA was and is the market share leader in the category.
When TIBCO acquired StreamBase in 2013, the purchase price of $49 million barely covered StreamBase’s total capitalization of $44 million.

In other words, among the companies that entered the streaming analytics market prior to 2005, there were no unicorns .

Even TIBCO, when it sold⁶⁶ itself to a private equity buyer in 2014, did so at a valuation of about four times revenue, a valuation appropriate for a mature company with limited growth prospects. At the time of the transaction, TIBCO’s revenue had declined for more than a year, a problem TIBCO’s CEO attributed to a transition to subscription pricing⁶⁷—a sure sign of disruption.

The real-time analytics market also witnessed one of the greatest examples of hype-driven bubbles in the case of CRM provider E.piphany Inc. E.piphany, founded in 1996, went public in September 1999, trading at $16. Two months later, the stock traded at $80.

In November 1999, E.piphany announced that it was acquiring real-time customer interaction management company Rightpoint for $392 million. Investors went wild, bidding the stock up to $158.

Four months later, with the stock trading at $317, E.piphany announced that it was acquiring⁶⁸ Octane, another real-time customer interaction management provider, for $3.2 billion, 91 times projected revenue.

E.piphany’s revenue peaked in 2000, then declined precipitously. By 2004, the stock traded at $3.50. SSA Global, an ERP vendor, acquired⁶⁹ the company for $4.20 a share in 2005.

The business success of companies in the streaming and real-time analytics market is only a proxy measure of a technology’s impact on the analytics value chain. However, when most companies in the category struggle to drive value, the implication is that there is little value to be driven.

The past, of course, does not necessarily foretell the future. TIBCO’s struggles beginning around 2012 imply rapid adoption of cloud and open source streaming technologies, whose business model may make streaming attractive for new use cases.

Streaming advocates are excited about its potential for the Internet of Things (IoT). The Internet of Things (IoT) is the network of objects and devices, including vehicles, machines, buildings, and other devices embedded with sensors and connected to other devices. Examples include smart grids, smart vehicles, smart homes, and so forth. Connected devices generate huge volumes of streaming data .

We note that IoT is at the peak⁷⁰ of its hype cycle as of this writing.

Managers must distinguish between the costs of streaming data processing, on the one hand, and the benefits of reduced latency.

For human BI and interactive query workloads, latency measured in seconds is more than adequate; few human analysts tracking events through a BI system can benefit from lower latency than that.⁷¹ Moreover, it is doubtful that any company ever lost money because junior program analysts had to wait a few minutes to view the results of a creative test.

The real potential for streaming analytics is in automated processes, where streaming engines can make decisions in a consistent manner, with much higher throughput and lower latency than humans can possibly deliver. Automation with streaming analytics will yield the highest economic benefits when applied to repeatable operations that are highly labor intensive, in operations where humans perform poorly, or in operations that are impossible for humans to perform .