Chapter 4. Learning from Data Using Spark

As we have laid the foundation for data to be harvested in the previous chapter, we are now ready to learn from the data. Machine learning is about drawing insights from data. Our objective is to give an overview of the Spark MLlib (short for Machine Learning library) and apply the appropriate algorithms to our dataset in order to derive insights. From the Twitter dataset, we will be applying an unsupervised clustering algorithm in order to distinguish between Apache Spark-relevant tweets versus the rest. We have as initial input a mixed bag of tweets. We first need to preprocess the data in order to extract the relevant features, then apply the machine learning algorithm to our dataset, and finally evaluate the results and the performance of our model.

In this chapter, we will cover the following points:

  • Providing an overview of the Spark MLlib module with its algorithms and the typical machine learning workflow.
  • Preprocessing the Twitter harvested dataset to extract the relevant features, applying an unsupervised clustering algorithm to identify Apache Spark-relevant tweets. Then, evaluating the model and the results obtained.
  • Describing the Spark machine learning pipeline.

Contextualizing Spark MLlib in the app architecture

Let's first contextualize the focus of this chapter on data-intensive app architecture. We will concentrate our attention on the analytics layer and more precisely machine learning. This will serve as a foundation for streaming apps as we want to apply the learning from the batch processing of data as inference rules for the streaming analysis.

The following diagram sets the context of the chapter's focus, highlighting the machine learning module within the analytics layer while using tools for exploratory data analysis, Spark SQL, and Pandas.

Contextualizing Spark MLlib in the app architecture
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset