Why do we need Spark Streaming?

As noted by Tathagata Das – committer and member of the project management committee (PMC) to the Apache Spark project and lead developer of Spark Streaming – in the Datanami article Spark Streaming: What is It and Who's Using it (https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/), there is a business need for streaming. With the prevalence of online transactions and social media, as well as sensors and devices, companies are generating and processing more data at a faster rate.

The ability to develop actionable insight at scale and in real time provides those businesses with a competitive advantage. Whether you are detecting fraudulent transactions, providing real-time detection of sensor anomalies, or reacting to the next viral tweet, streaming analytics is becoming increasingly important in data scientists' and data engineer's toolbox.

The reason Spark Streaming is itself being rapidly adopted is because Apache Spark unifies all of these disparate data processing paradigms (Machine Learning via ML and MLlib, Spark SQL, and Streaming) within the same framework. So, you can go from training machine learning models (ML or MLlib), to scoring data with these models (Streaming) and perform analysis using your favourite BI tool (SQL) – all within the same framework. Companies including Uber, Netflix, and Pinterest often showcase their Spark Streaming use cases:

Currently, there are four broad use cases surrounding Spark Streaming:

  • Streaming ETL: Data is continuously being cleansed and aggregated prior to being pushed downstream. This is commonly done to reduce the amount of data to be stored in the final data store.
  • Triggers: Real-time detection of behavioral or anomaly events trigger immediate and downstream actions. For example, a device that is within the proximity of a detector or beacon will trigger an alert.
  • Data enrichment: Real-time data joined to other datasets allowing for richer analysis. For example, including real-time weather information with flight information to build better travel alerts.
  • Complex sessions and continuous learning: Multiple sets of events associated with real-time streams are continuously analyzed and/or updating machine learning models. For example, the stream of user activity associated with an online game that allows us to better segment the user.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset