As we have laid the foundation for data to be harvested in the previous chapter, we are now ready to learn from the data. Machine learning is about drawing insights from data. Our objective is to give an overview of the Spark MLlib (short for Machine Learning library) and apply the appropriate algorithms to our dataset in order to derive insights. From the Twitter dataset, we will be applying an unsupervised clustering algorithm in order to distinguish between Apache Spark-relevant tweets versus the rest. We have as initial input a mixed bag of tweets. We first need to preprocess the data in order to extract the relevant features, then apply the machine learning algorithm to our dataset, and finally evaluate the results and the performance of our model.
In this chapter, we will cover the following points:
Let's first contextualize the focus of this chapter on data-intensive app architecture. We will concentrate our attention on the analytics layer and more precisely machine learning. This will serve as a foundation for streaming apps as we want to apply the learning from the batch processing of data as inference rules for the streaming analysis.
The following diagram sets the context of the chapter's focus, highlighting the machine learning module within the analytics layer while using tools for exploratory data analysis, Spark SQL, and Pandas.