Stream processing

There are a few extensions to the EDA architecture paradigm for real-time processing. To evaluate and differentiate between them, let's consider a sample use case:

On our travel website, we want to know the number of visitors in the last 30 minutes and get a notification when this number crosses a threshold (say, 100,000 visitors). Let's assume that each visitor action causes an event.

The first (and most straightforward) application of the Distributed Computation model is to have Hadoop do the counts. However, the job needs to finish and restart within 30 minutes. With the Map-Reduce paradigm, such real-time requests can lead to brittle systemsthey will break if the requirements change slightly, parallel jobs start executing, or when the size of the data exponentially grows. You can use Apache Spark to improve performance through in-memory computation.

However, these approaches don't scale, as their main focus is batching (and throughput optimization) rather than low-latency computation (which is needed to generate timely notifications in the preceding use case). Essentially, we want to process data as soon as it comes in, rather than batching events. This model is called stream processing.

The idea here is to create a graph of operators and then inject events into this processing graph. Processing can involve things such as event enrichment, group-by, custom processing, or event combination. Apache Storm is an example of a stream processing framework. A Storm program is defined in terms of two abstractions: Spouts and Bolts. A spout is a stream source (say, a Kafka Topic). A Bolt is a processing element (logic written by the programmer) that works on one or more spouts. A Storm program is essentially a graph of spouts and bolts, and is called a Topology.

Kafka streams is another framework that is built-in in the latest Kafka version (Kafka is discussed in detail in Chapter 6, Messaging) and allows per-event computation. It uses the Kafka partitioning model to partition data for processing it, enabled by an easy-to-use programming library. This library creates a set of stream tasks and assigns Kafka partitions to it for processing. Sadly, at the time of writing, this is not available for Go programs yet.

Apache Samza is another framework for stream processing and uses YARN to spawn processing tasks and Kafka to provide partitioned data for these tasks. Kasper (https://github.com/movio/kasper) is a processing library in Go, and is inspired by Apache Samza. It processes messages in small batches and uses centralized key-value stores, such as Redis, Cassandra, or Elasticsearch, for maintaining state during processing.

There are also Complex Event Processing (CEP) engines, which allow users to write SQL-like queries on a stream of events (say Kafka), with a key objective of millisecond response time. This pattern was developed initially for stock-market-related use cases. Though CEP and stream processing frameworks started with different requirements, over time, both technologies have started to share the same feature sets.

The following diagram summarizes the different stream processing techniques:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset