Data checkpointing

Data checkpointing saves the RDDs to the HDFS. In the case of a failure in the streaming application, the RDDs can be recovered, and the processing can continue where it left off. Not only is recovery good in the case of data checkpointing, but it also helps when RDDs are lost because of cache cleanup or the loss of an executor. Now, any generated RDDs do not need to wait for parent RDDs in the DAG lineage to be reprocessed.

It is necessary for checkpointing to be enabled for any applications that have the following requirements:

  • Stateful transformations are applied. If updateStateBykey() or reduceByKeyAndWindow() (along with their inverse functions) are used, then the checkpoint directory has to be given in order for RDD checkpointing to take place.
  • Recovering from driver failures while running the application. Metadata checkpoints help to recover information on progress.

If there are no stateful transformations, then the application can be run without having checkpointing enabled.

 There could be a loss of received, but not yet processed, data.

Something to take note of is that RDD checkpointing means saving each RDD to storage. This would have the effect of increasing the processing time of the batches that have RDDs checkpointed. So, the checkpointing interval must be set and adjusted so as to not hinder performance, which is important when dealing with the expectations of real-time processing.

Tiny batch sizes (1 second, for example) mean that checkpointing occurs too frequently, and this might reduce the operation throughput. Conversely, checkpointing infrequently will cause the task size to grow, causing processing delays because of the large amount of queued data.

Stateful transformations that need RDD checkpointing have a default checkpointing interval of 10 seconds, at the very least. A good setting to start with is a checkpointing interval of 5 to 10 DStream sliding intervals.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset