Spark Streaming

Spark Streaming wasn't the first streaming architecture. Over time, multiple technologies have been developed in order to address various real-time processing needs. One of the first popular stream processor technologies was Twitter Storm, and it was used in many businesses. Spark includes the streaming library, which has grown to become the most widely used technology today. This is mainly because Spark Streaming holds some significant advantages over all of the other technologies, the most important being its integration of Spark Streaming APIs within its core API. Not only that, but Spark Streaming is also integrated with Spark ML and Spark SQL, along with GraphX. Because of all of these integrations, Spark is a powerful and versatile streaming technology.

Note that https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html has more information on Spark Streaming Flink, Heron (Twitter Storm's successor), and Samza and their various features; for example, their ability to handle events while minimizing latency. However, Spark Streaming consumes data and processes it in microbatches. The size of these microbatches is of a minimum of 500 milliseconds.

Apache Apex, Flink, Samza, Heron, Gearpump, and other new technologies are all competitors of Spark Streaming in some cases. Spark Streaming, will not be the right fit if you need true, event-by-event processing.

Spark Streaming works by creating batches of events at certain time intervals, as configured by the user, and delivering them  for processing at another specified time interval.

Spark Streaming supports several input sources and can write results to several sinks:

Similar to SparkContext, Spark Streaming contains a StreamingContext, the primary point of entry for the streaming to take place. The StreamingConext depends on the SparkContext, and the SparkContext can actually be used directly in the streaming task. The StreamingContext is similar to the SparkContext, the difference being that StreamingContext requires a specification, by the program, of a time interval/duration of batching interval, ranging from minutes to milliseconds:

 The SparkContext is the main point of entry. The StreamingContext reuses the logic that is part of SparkContext (task scheduling and resource management).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset