Classic Spark Streaming

Spark Streaming has a context wrapper called StreamingContext, which wraps around SparkContext and is the entry point to the Spark Streaming functionality. Streaming data, by definition, is continuous and needs to be time-sliced into the process. This slice of time is called a batch interval, which is specified when StreamingContext is created. There is one-to-one mapping of an RDD and batch; that is, each batch results in one RDD. As you can see in the following image, Spark Streaming takes continuous data, breaks it into batches, and feeds it to Spark:

Batch interval is important to optimize your streaming application. Ideally, you want to process data at least as fast as it is getting ingested; otherwise, your application will develop a backlog. Spark Streaming collects data for the duration of a batch interval, say, 2 seconds. The moment this 2-second interval is over, data collected in that interval will be given to Spark for processing, and Streaming will focus on collecting data for the next batch interval. Now, this 2-second batch interval is all Spark has to process data for, as it should be free to receive data from the next batch. If Spark can process the data faster, you can reduce the batch interval to, say, 1 second. If Spark is not able to keep up with this speed, you will have to increase the batch interval.

A continuous stream of RDDs in Spark Streaming needs to be represented in the form of an abstraction through which it can be processed. This abstraction is called Discretized Stream (DStream). Any operation applied to DStream results in an operation on underlying RDDs.

Every input, DStream is associated with a receiver (except for the file stream). A receiver receives data from the input source and stores it in Spark's memory. There are two types of streaming sources:

  • Basic sources, such as file and socket connections
  • Advanced sources, such as Kafka and Flume

Spark Streaming also provides windowed computations in which you can apply the transformation over a sliding window of data. A sliding window operation is based on two parameters:

  • Window length: This is the duration of the window. For example, if you want to get analytics of the last 1 minute of data, the window length will be 1 minute.
  • Sliding interval: This depicts how frequently you want to perform an operation. Say, you want to perform the operation every 10 seconds. This means that every 10 seconds, 1 minute of the window will have 50 seconds of data, which would be common for the last window as well, and 10 seconds of new data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset