Spark Streaming supports real-time processing of fast moving, streaming data to gain insights for business and make business decisions in real-time or near real-time. It is an extension to Spark core to support stream processing. Spark Streaming is production-ready and is used in many organizations. This chapter helps you get started with writing real-time applications including Kafka and HBase. This chapter also helps you to get started with the new concept of Structured Streaming introduced in Spark 2.0.
This chapter is divided into the following sub-topics:
Big Data is generally ingested in real-time and the value of Big Data must be extracted on its arrival to make business decisions in real-time or near real-time, for example, fraud detection in financial transaction streams to accept or reject a transaction.
But, what is real-time and near real-time processing? The meaning of real-time or near real-time can vary from business to business and there is no standard definition for this. According to me, real-time means processing at the speed of a business. For a financial institution doing fraud detection, real-time means milliseconds for them. For a retail company doing click-stream analytics, real-time means seconds.
There are really only two paradigms for data processing: batch and real-time. Batch processing applications fundamentally provide high-latency, while real-time applications provide low latency. So, processing a few terabytes of data all at once will not be finished in a second. Real-time processing looks at smaller amounts of data as they arrive, to do intense computations as per the business requirements.
Real-time or near real-time systems allow you to respond to data as it arrives without necessarily persisting it to a database first. Stream processing is one kind of real-time processing that essentially processes a stream of events at high volume.
Every business wants to process the data in real-time to sustain their business. Real-time analytics are becoming a new standard way of data processing.
But, unlike batch processing, implementing real-time systems is more complex and requires a robust platform and framework for low latency response, high scalability, high reliability, and fault tolerance to keep applications running 24 x 7 x 365. Failures can happen while receiving data from a streaming source, processing the data, and pushing the results to a database. So, the architecture should be robust enough to take care of failures at all levels, process a record exactly once, and not miss any records.
Spark Streaming provides near real-time processing responses with an approximate few hundred milliseconds latency and efficient fault tolerance. Many other frameworks such as Storm, Trident, Samza, and Apache Flink are available for real-time stream processing. The following table shows the differences between these frameworks:
Spark Streaming |
Storm |
Trident |
Samza |
Apache Flink | |
---|---|---|---|---|---|
Architecture |
Micro-batch |
Record-at-a-time |
Micro-batch |
Record-at-a-time |
Continuous flow, operator-based |
Language APIs |
Java, Scala, Python |
Java, Scala, Python, Ruby, Clojure |
Java, Clojure, Scala |
Java, Scala |
Java, Scala |
Resource Managers |
Yarn, Mesos, Standalone |
Yarn and Mesos |
Yarn and Mesos |
Yarn |
Standalone, Yarn |
Latency |
~0.5 seconds |
~100 Milliseconds |
~0.5 seconds |
~100 Milliseconds |
~100 Milliseconds |
Exactly once processing |
Yes |
No |
Yes |
No |
Yes |
Stateful Operations |
Yes |
No |
Yes |
Yes |
Yes |
Functions of sliding windows |
Yes |
No |
Yes |
No |
Yes |
Let's take a look at some of the pros and cons of Spark Streaming.
Pros:
Cons:
Spark Streaming was introduced to the Apache Spark project with its 0.7 release in February 2013. From the alpha testing phase, a stable version, 0.9, was released in February 2014. The Python API was added in Spark 1.2 with some limitations. All new releases introduced many new features and performance improvements such as Streaming UI, write-ahead logs (WAL), exactly once write ahead log, direct Kafka support, and back pressure support.