Chapter 9.  Advanced Machine Learning with Streaming and Graph Data

This chapter guides the reader on how to apply machine learning techniques with the help of Spark MLlib, Spark ML and Spark Streaming to streaming and graph data using GraphX. For example, topic modeling from the real-time tweets data from Twitter. The readers will be able to use available APIs to build real-time and predictive applications from streaming data sources such as Twitter. Through the Twitter data analysis, we will show how to perform large scale social sentiment analysis. We will also show how to develop a large-scale movie recommendation using Spark MLlib, which is an implicit part of social network analysis. In a nutshell, the following topics will be covered throughout this chapter:

  • Developing real-time ML pipelines
  • Time series and social network analysis
  • Movie recommendation using Spark
  • Developing a real-time ML pipeline from streaming
  • ML pipeline on graph data and semi-supervised graph-based learning

However, what it really needs in order to be an effective and emerging ML application is a continuous flow of labeled data. Consequently, preprocessing the large-scale unstructured data and accurate labeling to that data essentially introduces many unwanted latencies.

Nowadays, we hear and read a lot about real-time machine learning. More or less, people usually provide this appealing business scenario when discussing sentiment analysis from Social Network Services (SNS), credit card fraud detection systems, or mining purchase rules related to the customer from business oriented transactional data.

According to many ML experts, it is possible to continuously update the credit card fraud detection model in real time. It's fantastic, but not realistic to me, for several reasons. Firstly, ensuring the continuous flow of this kind of data is not needed for model retraining. Secondly, creating labeled data would probably be the slowest and the most expensive step in most of the machine learning systems.

Developing real-time ML pipelines

In order to develop a real-time machine learning application, we need to have access to a continuous flow of data. The data might include transactional data, simple texts, tweets from Twitter, messaging or streaming from Flume or Kafka, and so on, as this is mostly unstructured data.

To deploy these kinds of ML applications, we need to go through a series of steps. The most unreliable source of the data that would serve our purpose is the real-time data from several sources. Often networks are a performance bottleneck.

For example, it's not guaranteed that you will always receive a bunch of tweets from Twitter. Moreover, labeling this data towards building an ML model on the fly is not a realistic idea. Nevertheless, here we provide a real insight on how we could develop and deploy an ML pipeline from real-time streaming data. Figure 1 shows the workflow of a real-time ML application development.

Streaming data collection as unstructured text data

We would like to stress here that real time stream data collection depends on:

  • The purpose of the data collection. If the purpose is to develop a credit card fraud detection online, then the data should be collected from your own network through the web API. If the purpose is to collect social media sentiment analysis then data could be collected from Twitter, LinkedIn, Facebook, or newspaper sites, and if the purpose is to network anomaly detection, data could be collected from the network data.
  • Data availability is an issue since not all social media platforms provide public APIs for collecting data. The network condition is important since stream data is huge and needs very fast network connectivity.
  • Storage capability is an important consideration since a collection of a few minutes of tweets data, for example, could contribute to several GB of data.

Moreover, we should wait at least a couple of days before marking the transactions as Fraud or Not Fraud, for example. In contrast, if somebody reported a fraud transaction, we can immediately label this transaction as Fraud for the sake of simplicity.

Labeling the data towards making the supervised machine learning

The labeled data set plays a central role in the whole process. It ensures that it is very easy to change the parameters of an algorithm such as the feature normalization or loss function. In that case, we would have several options of choosing the algorithm itself from logistic regression, to Support Vector Machine (SVM), or random forest, for example.

However, we cannot change the labeled data set since this information is predefined and your model should predict the labels that you already have. In previous chapters, we have shown that labeling the structured data takes a considerable amount of time.

Now think about the fully unstructured stream data that we would be receiving from streaming or real-time sources. In that case, labeling the data would take a considerable amount of time. Nevertheless, we will also have to do the pre-processing, such as tokenization, cleaning, indexing, removing stop words, and removing special characters from the unstructured data.

Now, essentially, there would be a question of how long does the data labeling process take? The final thing about the labeled dataset is that we should understand that the labeled dataset might be biased sometimes if we don't do the labeling carefully, which might lead to a lot of issues with the model's performance.

Creating and building the model

For training the sentiment analytics, the credit card fraud detection model, and the association rule mining model, we need to have a lot of examples of transaction data that is as accurate as possible. Once we have the labeled dataset, we are ready to train and build the model:

Creating and building the model

Figure 1: Real-time machine learning workflow.

In Chapter 7, Tuning Machine Learning Models, we discussed how to choose appropriate models and ML algorithms in order to produce better predictive analytics. The model can be presented as a binary or multiclass classifier with several classes. Alternatively, use an LDA model for the sentiment analysis using the topic modeling concept. In a nutshell, Figure 1 shows a real-time machine learning workflow.

Real-time predictive analytics

When your ML model is properly trained and built, your model is ready for doing the real-time predictive analytics. If you get a good prediction from the model it would be fantastic. However, as we previously mentioned when discussing some accuracy issues such as true positives and false positives, if the number of false positives is high then that means the performance of the model is not satisfactory.

This essentially means three things: we have not properly labeled the stream dataset, in that case iterate step two (labeling the data in Figure 1), or have not selected the proper ML algorithm to train the model, and finally we have not tuned what would eventually help us to find the appropriate hyperparameters or model selection; in that case, go straight to step seven (model deployment in Figure 1).

Tuning the ML model for improvement and model evaluation

As mentioned in step four (model evaluation in Figure 1), if the performance of the model is not satisfactory or convincing enough, then we need to tune the model. As discussed in Chapter 7, Tuning Machine Learning Models, we learned about how to choose the appropriate model and ML algorithms in order to produce better predictive analytics. There are several techniques for tuning the models, performance and we can go for them based on requirements and the situation. When we have done the tuning, finally we should do the model evaluation.

Model adaptability and deployment

When we have the tuned and found best model, the machine learning model has to be prepared in order to learn incrementally over the new data types when the model is updated each time it sees a new training instance. When we have our model ready for making the accurate and reliable prediction for the large-scale streaming data, we can deploy it in real life.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset