Chapter 10. Structured Streaming

This chapter will provide a jump-start on the concepts behind Spark Streaming and how this has evolved into Structured Streaming. An important aspect of Structured Streaming is that it utilizes Spark DataFrames. This shift in paradigm will make it easier for Python developers to start working with Spark Streaming.

In this chapter, your will learn:

  • What is Spark Streaming?
  • Why do we need Spark Streaming?
  • What is the Spark Streaming application data flow?
  • Simple streaming application using DStream
  • A quick primer on Spark Streaming global aggregations
  • Introducing Structured Streaming

Note, for the initial sections of this chapter, the example code used will be in Scala, as this was how most Spark Streaming code was written. When we start focusing on Structured Streaming, we will work with Python examples.

What is Spark Streaming?

At its core, Spark Streaming is a scalable, fault-tolerant streaming system that takes the RDD batch paradigm (that is, processing data in batches) and speeds it up. While it is a slight over-simplification, basically Spark Streaming operates in mini-batches or batch intervals (from 500ms to larger interval windows).

As noted in the following diagram, Spark Streaming receives an input data stream and internally breaks that data stream into multiple smaller batches (the size of which is based on the batch interval). The Spark engine processes those batches of input data to a result set of batches of processed data.

What is Spark Streaming?

Source: Apache Spark Streaming Programming Guide at: http://spark.apache.org/docs/latest/streaming-programming-guide.html

The key abstraction for Spark Streaming is Discretized Stream (DStream), which represents the previously mentioned small batches that make up the stream of data. DStreams are built on RDDs, allowing Spark developers to work within the same context of RDDs and batches, only now applying it to their streaming problems. Also, an important aspect is that, because you are using Apache Spark, Spark Streaming integrates with MLlib, SQL, DataFrames, and GraphX.

The following figure denotes the basic components of Spark Streaming:

What is Spark Streaming?

Source: Apache Spark Streaming Programming Guide at: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is a high-level API that provides fault-tolerant exactly-once semantics for stateful operations. Spark Streaming has built in receivers that can take on many sources, with the most common being Apache Kafka, Flume, HDFS/S3, Kinesis, and Twitter. For example, the most commonly used integration between Kafka and Spark Streaming is well documented in the Spark Streaming + Kafka Integration Guide found at: https://spark.apache.org/docs/latest/streaming-kafka-integration.html.

Also, you can create your own custom receiver, such as the Meetup Receiver (https://github.com/actions/meetup-stream/blob/master/src/main/scala/receiver/MeetupReceiver.scala), which allows you to read the Meetup Streaming API (https://www.meetup.com/meetup_api/docs/stream/2/rsvps/) using Spark Streaming.

Note

Watch the Meetup Receiver in Action

If you are interested in seeking the Spark Streaming Meetup Receiver in action, you can refer to the Databricks notebooks at: https://github.com/dennyglee/databricks/tree/master/notebooks/Users/denny%40databricks.com/content/Streaming%20Meetup%20RSVPs which utilize the previously mentioned Meetup Receiver.

The following is a screenshot of the notebook in action left window, while viewing the Spark UI (Streaming Tab) on the right.

What is Spark Streaming?
What is Spark Streaming?

You will be able to use Spark Streaming to receive Meetup RSVPs from around the country (or world) and get a near real-time summary of Meetup RSVPs by state (or country). Note, these notebooks are currently written in Scala.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset