Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Real-Time Analytics with Spark Streaming and Structured Streaming

Spark Streaming supports real-time processing of fast moving, streaming data to gain insights for business and make business decisions in real-time or near real-time. It is an extension to Spark core to support stream processing. Spark Streaming is production-ready and is used in many organizations. This chapter helps you get started with writing real-time applications including Kafka and HBase. This chapter also helps you to get started with the new concept of Structured Streaming introduced in Spark 2.0.

This chapter is divided into the following sub-topics:

Introducing real-time processing
Architecture of Spark Streaming
Stateless and stateful stream processing
Transformations and actions
Input sources and output stores
Spark Streaming with Kafka and HBase
Advanced Spark Streaming concepts
Monitoring applications
Introducing Structured Streaming

Introducing real-time processing

Big Data is generally ingested in real-time and the value of Big Data must be extracted on its arrival to make business decisions in real-time or near real-time, for example, fraud detection in financial transaction streams to accept or reject a transaction.

But, what is real-time and near real-time processing? The meaning of real-time or near real-time can vary from business to business and there is no standard definition for this. According to me, real-time means processing at the speed of a business. For a financial institution doing fraud detection, real-time means milliseconds for them. For a retail company doing click-stream analytics, real-time means seconds.

There are really only two paradigms for data processing: batch and real-time. Batch processing applications fundamentally provide high-latency, while real-time applications provide low latency. So, processing a few terabytes of data all at once will not be finished in a second. Real-time processing looks at smaller amounts of data as they arrive, to do intense computations as per the business requirements.

Real-time or near real-time systems allow you to respond to data as it arrives without necessarily persisting it to a database first. Stream processing is one kind of real-time processing that essentially processes a stream of events at high volume.

Every business wants to process the data in real-time to sustain their business. Real-time analytics are becoming a new standard way of data processing.

But, unlike batch processing, implementing real-time systems is more complex and requires a robust platform and framework for low latency response, high scalability, high reliability, and fault tolerance to keep applications running 24 x 7 x 365. Failures can happen while receiving data from a streaming source, processing the data, and pushing the results to a database. So, the architecture should be robust enough to take care of failures at all levels, process a record exactly once, and not miss any records.

Spark Streaming provides near real-time processing responses with an approximate few hundred milliseconds latency and efficient fault tolerance. Many other frameworks such as Storm, Trident, Samza, and Apache Flink are available for real-time stream processing. The following table shows the differences between these frameworks:

	Spark Streaming	Storm	Trident	Samza	Apache Flink
Architecture	Micro-batch	Record-at-a-time	Micro-batch	Record-at-a-time	Continuous flow, operator-based
Language APIs	Java, Scala, Python	Java, Scala, Python, Ruby, Clojure	Java, Clojure, Scala	Java, Scala	Java, Scala
Resource Managers	Yarn, Mesos, Standalone	Yarn and Mesos	Yarn and Mesos	Yarn	Standalone, Yarn
Latency	~0.5 seconds	~100 Milliseconds	~0.5 seconds	~100 Milliseconds	~100 Milliseconds
Exactly once processing	Yes	No	Yes	No	Yes
Stateful Operations	Yes	No	Yes	Yes	Yes
Functions of sliding windows	Yes	No	Yes	No	Yes

Pros and cons of Spark Streaming

Let's take a look at some of the pros and cons of Spark Streaming.

Pros:

Provides a simple programming model using Java, Scala, and Python APIs
High throughput with fast fault recovery
Integrates batch and streaming processing
Easy to implement Spark SQL and Machine Learning Library (MLlib) operations within Spark Streaming

Cons:

While micro-batch processing provides many benefits, latency is relatively higher than other record-at-time-based systems.

History of Spark Streaming

Spark Streaming was introduced to the Apache Spark project with its 0.7 release in February 2013. From the alpha testing phase, a stable version, 0.9, was released in February 2014. The Python API was added in Spark 1.2 with some limitations. All new releases introduced many new features and performance improvements such as Streaming UI, write-ahead logs (WAL), exactly once write ahead log, direct Kafka support, and back pressure support.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Real-Time Analytics with Spark Streaming and Structured Streaming

Create new playlist

Sign In

Sign Up

Chapter 5. Real-Time Analytics with Spark Streaming and Structured Streaming

Introducing real-time processing

Pros and cons of Spark Streaming

History of Spark Streaming

Table of Contents for
5. Real-Time Analytics with Spark Streaming and Structured Streaming