Chapter 5. Real-Time Analytics with Spark Streaming and Structured Streaming

Spark Streaming supports real-time processing of fast moving, streaming data to gain insights for business and make business decisions in real-time or near real-time. It is an extension to Spark core to support stream processing. Spark Streaming is production-ready and is used in many organizations. This chapter helps you get started with writing real-time applications including Kafka and HBase. This chapter also helps you to get started with the new concept of Structured Streaming introduced in Spark 2.0.

This chapter is divided into the following sub-topics:

  • Introducing real-time processing
  • Architecture of Spark Streaming
  • Stateless and stateful stream processing
  • Transformations and actions
  • Input sources and output stores
  • Spark Streaming with Kafka and HBase
  • Advanced Spark Streaming concepts
  • Monitoring applications
  • Introducing Structured Streaming

Introducing real-time processing

Big Data is generally ingested in real-time and the value of Big Data must be extracted on its arrival to make business decisions in real-time or near real-time, for example, fraud detection in financial transaction streams to accept or reject a transaction.

But, what is real-time and near real-time processing? The meaning of real-time or near real-time can vary from business to business and there is no standard definition for this. According to me, real-time means processing at the speed of a business. For a financial institution doing fraud detection, real-time means milliseconds for them. For a retail company doing click-stream analytics, real-time means seconds.

There are really only two paradigms for data processing: batch and real-time. Batch processing applications fundamentally provide high-latency, while real-time applications provide low latency. So, processing a few terabytes of data all at once will not be finished in a second. Real-time processing looks at smaller amounts of data as they arrive, to do intense computations as per the business requirements.

Real-time or near real-time systems allow you to respond to data as it arrives without necessarily persisting it to a database first. Stream processing is one kind of real-time processing that essentially processes a stream of events at high volume.

Every business wants to process the data in real-time to sustain their business. Real-time analytics are becoming a new standard way of data processing.

But, unlike batch processing, implementing real-time systems is more complex and requires a robust platform and framework for low latency response, high scalability, high reliability, and fault tolerance to keep applications running 24 x 7 x 365. Failures can happen while receiving data from a streaming source, processing the data, and pushing the results to a database. So, the architecture should be robust enough to take care of failures at all levels, process a record exactly once, and not miss any records.

Spark Streaming provides near real-time processing responses with an approximate few hundred milliseconds latency and efficient fault tolerance. Many other frameworks such as Storm, Trident, Samza, and Apache Flink are available for real-time stream processing. The following table shows the differences between these frameworks:

 

Spark Streaming

Storm

Trident

Samza

Apache Flink

Architecture

Micro-batch

Record-at-a-time

Micro-batch

Record-at-a-time

Continuous flow, operator-based

Language APIs

Java, Scala, Python

Java, Scala, Python, Ruby, Clojure

Java, Clojure, Scala

Java, Scala

Java, Scala

Resource Managers

Yarn, Mesos, Standalone

Yarn and Mesos

Yarn and Mesos

Yarn

Standalone, Yarn

Latency

~0.5 seconds

~100 Milliseconds

~0.5 seconds

~100 Milliseconds

~100 Milliseconds

Exactly once processing

Yes

No

Yes

No

Yes

Stateful Operations

Yes

No

Yes

Yes

Yes

Functions of sliding windows

Yes

No

Yes

No

Yes

Pros and cons of Spark Streaming

Let's take a look at some of the pros and cons of Spark Streaming.

Pros:

  • Provides a simple programming model using Java, Scala, and Python APIs
  • High throughput with fast fault recovery
  • Integrates batch and streaming processing
  • Easy to implement Spark SQL and Machine Learning Library (MLlib) operations within Spark Streaming

Cons:

  • While micro-batch processing provides many benefits, latency is relatively higher than other record-at-time-based systems.

History of Spark Streaming

Spark Streaming was introduced to the Apache Spark project with its 0.7 release in February 2013. From the alpha testing phase, a stable version, 0.9, was released in February 2014. The Python API was added in Spark 1.2 with some limitations. All new releases introduced many new features and performance improvements such as Streaming UI, write-ahead logs (WAL), exactly once write ahead log, direct Kafka support, and back pressure support.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset