Spark Streaming supports three kinds of input sources:
Multiple receivers can be created in the same application to receive data from different sources. It is important to allocate enough resources (cores and memory) for enabling receivers and tasks to execute simultaneously. For example, if you start your application with one core, it will be taken by the receiver and no tasks will be executed because of a lack of available cores.
There are four basic sources available in Spark StreamingContext as shown in the following table:
Let's take a look at some of the important custom sources:
Input DStreams can also be created out of custom data sources of your own, by extending the API. Implement a user-defined receiver that can receive data from the new custom source. Python API does not support this functionality yet.
There are two types of receivers in Spark Streaming based on reliability:
So, depending on the type of the receiver, data can be received at-least-once or exactly-once. For example, while the regular Kafka API provides at-least-once semantics, the Kafka direct API provides exactly-once semantics. For unreliable receivers, to make sure that received blocks are not lost in case of driver failures, enabling WAL will help.
Once the data is processed in the Spark Streaming application, it can be written to a variety of sinks such as HDFS, any RDBMS database, HBase, Cassandra, Kafka, or Elasticsearch, and so on. All output operations are processed one-at-a-time and they are executed in the same order they are defined in the application. Also, in some cases, the same record can be processed more than one time and will be duplicated in output stores.
It is important to understand at-most-once, at-least-once, and exactly-once guarantees offered by Spark Streaming. For example, using the Kafka direct API provides exactly once semantics, so a record is received, processed, and sent to the output store exactly once. Note that, irrespective of whether data is sent once or twice from the source, the Spark Streaming application processes any record exactly once. So, the outputs are affected by choosing the type of receiver. For NoSQL databases such as HBase, sending the same record twice will just update it with the new version. For updating the same record on HBase, the timestamp can be passed from source systems instead of HBase picking up a timestamp.
So, if the record is processed and sent multiple times to HBase, it will just replace the existing record because of the same timestamp. This concept is called Idempotent updates. However, for RDBMS databases, sending the same record twice may either throw an exception or insert the second record as a duplicate. Another way to achieve exactly once is to follow the Transactional updates approach to update exactly once.