Late arriving/out-of-order data

If there is leader selection in streaming challenges, it would go to the late data. This is such a streaming-specific issue that folks not very familiar with streaming find it surprising that this issue is so prevalent. 

There are two notions of time in streaming:

  • Event time: This is the time when an event actually happened, for example, measuring the temperature on a drive to an industrial site. Almost always, this event will contain this time as part of the record.
  • Processing time: This is measured by the program that processed the event, for example, if the time series IoT event is processed in the cloud, then the processing time is the time this event reached the component (like Kinesis), which is doing the processing. 

In stream-processing applications, this time lag between the event time and processing time varies, and this leads to late or out-of-order data. There are various reasons for this delay, for example:

  • Network latencies
  • Variance in data load
  • Batching of events

In Spark Streaming based on DStream, it is not easy to incorporate event time. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset