Streaming is not an island

Streaming apps are never standalone applications, but part of a larger ecosystem. First, a streaming application may ingest data from complex data sources. Data may come from relational and non-relational sources. Data can be of various formats, such as .json and .avro. Data can be at various levels of normalization. To add to it, every real data source has tons of dirty data; this can lead to a massive cycle of data cleaning and data preparation. 

Streaming applications may interface with interactive analytics apps or machine-learning apps. It means there are throughput and scalability challenges that need to be resolved differently in each case. There are a few tips that can help in this optimization:

  • Data temperature awareness: All of the data is not created in the same way, and all of the data does not need to be treated the same way. There is the temperature of the data (how recent the data is) and the importance of data, which are correlated. One example can be online machine learning for fraud detection where the latency service level agreement (SLA) is in milliseconds. You may also want to update your ML model in real time, based on the data you are getting. 
  • Avoiding persisting on disk: Persisting/checkpointing data on the disk is an expensive operation and something that killed MapReduce (besides other factors). Maximize the use of memory. 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset