Summary

Spark Streaming is based on a micro-batching model that is suitable for applications with throughput and high latency (> 0.5 seconds). Spark Streaming's DStream API provides transformations and actions for working with DStreams, including conventional transformations, window operations, output actions, and stateful operations such as updateStateByKey. Spark Streaming supports a variety of input sources and output sources used in the Big Data ecosystem. Spark Streaming supports the direct approach with Kafka, which really provides great benefits such as exactly once processing and avoiding WAL replication.

There are two types of failures in a Spark Streaming application; executor failure and driver failure. Executor failures are automatically taken care of by the Spark Streaming framework, but for handling driver failures, checkpointing and WAL must be enabled with high availability options for the driver such as --supervise.

Structured Streaming is a new paradigm shift in streaming computing, which enables building continuous applications with end-to-end exactly once guarantees and data consistency even in case of node delays and failures. Structured Streaming is introduced in Spark 2.0 as an alpha version and it is expected to become stable in upcoming versions.

The next chapter introduces an interesting topic; notebooks and data flows with Spark and Hadoop.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset