Chapter 6.  Building Scalable Machine Learning Pipelines

The ultimate goal of machine learning is to make a machine that can automatically build models from data without requiring tedious and time-consuming human involvement and interaction. Therefore, this chapter guides the readers through creating some practical and widely used machine learning pipelines and applications using Spark MLlib and Spark ML. Both APIs will be described in detail, and a baseline use case will also be covered for both. Then we will focus on scaling up the ML application so that it can cope with increasing data loads. After reading all the sections in this chapter, readers will be able to differentiate between both APIs and select the one which best fits their requirements. In a nutshell, the following topics will be covered throughout this chapter:

  • Spark machine learning pipeline APIs
  • Cancer-diagnosis pipeline with Spark
  • Cancer-prognosis pipeline with Spark
  • Market basket analysis with Spark Core
  • OCR pipeline with Spark
  • Topic modeling using Spark MLlib and ML
  • Credit-risk-analysis pipeline with Spark
  • Scaling the ML pipelines
  • Tips and performance considerations

Spark machine learning pipeline APIs

MLlib's goal is to make practical machine learning (ML) scalable and easy. Spark introduces the pipeline API for the easy creation and tuning of practical ML pipelines. As discussed in Chapter 4, Extracting Knowledge through Feature Engineering, a practical ML pipeline involves a sequence of data collection, pre-processing, feature extraction, feature selection, model fitting, validation, and model evaluation stages. For example, classifying the text documents might involve text segmentation and cleaning, extracting features, and training a classification model with cross-validation toward tuning. Most ML libraries are not designed for distributed computation, or they do not provide native support for pipeline creation and tuning.

Dataset abstraction

As described in Chapter 1, Introduction to Data Analytics with Spark, when running SQL from within another programming language, the results return as a DataFrame. A DataFrame is a distributed collection of data organized into named columns. A Dataset, on the other hand, is an interface that tries to provide the benefits of RDDs out of the Spark SQL.

A Dataset can be constructed from JVM objects, which can be used both in Scala and Java. In the Spark pipeline design, a dataset is represented by Spark SQL's Dataset. An ML pipeline involves a number of the sequence of Dataset transformations and models. Each transformation takes an input dataset and outputs the transformed dataset, which becomes the input to the next stage.

Consequently, the data import and export are the start and end point of an ML pipeline. To make these easier, Spark MLlib and Spark ML provide import and export utilities of a Dataset, DataFrame, RDD, and model, for several application-specific types, including:

  • LabeledPoint for classification and regression
  • LabeledDocument for cross-validation and Latent Dirichlet Allocation (LDA)
  • Rating and ranking for collaborative filtering

However, real datasets usually contain numerous types, such as user ID, item IDs, labels, timestamps, and raw records.

Unfortunately, the current utilities of Spark implementation cannot easily handle datasets consisting of these types, especially time-series datasets. If you recall the section Machine learning pipeline - an overview, in Chapter 4, Extracting Feature through Feature Engineering, feature transformation usually forms the majority of a practical ML pipeline. A feature transformation can be viewed as appending or dropping a new column created from existing columns.

In Figure 1, Text processing for machine learning model, you will see that the text tokenizer breaks a document into a bag of words. After that, the TF-IDF algorithm converts a bag of words into a feature vector. During the transformations, the labels need to be preserved for the model-fitting stage:

Dataset abstraction

Figure 1: Text processing for machine learning model (DS indicates data sources)

If you recall Figure 5 and Figure 6 from Chapter 4, Extracting Feature through Feature Engineering, ID, text, and words are conceded during the transformations steps. They are useful in making predictions and model inspection. However, they are actually unnecessary for model fitting to state. According to a Databricks blog on ML Pipeline at https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html, it doesn't provide much information if the prediction dataset only contains the predicted labels.

Consequently, if you want to inspect the prediction metrics, such as the accuracy, precision, recall, weighted true positives, and weighted false positives, it is quite useful to look at the predicted labels along with the raw input text and tokenized words. The same recommendation also applies to other machine learning applications using Spark ML and Spark MLlib, too.

Therefore, an easy conversion between RDDs, Dataset, and DataFrames has been made possible for in-memory, disk, or external data sources such as Hive and Avro. Although creating new columns from existing columns is easy with user-defined functions, the manifestation of Dataset is a lazy operation.

In contrast, the Dataset supports only some standard data types. However, to increase the usability and for making a better fit for the machine learning model, Spark has also added the support for the Vector type as a user-defined type that supports both dense and sparse feature vectors under the mllib.linalg.DenseVector and mllib.linalg.Vector.

Tip

Complete DataFrame, Dataset, and RDD examples in Java, Scala, and Python can be found under the examples/src/main/ folder under the Spark distribution. Interested readers can refer to Spark SQL's user guide at http://spark.apache.org/docs/latest/sql-programming-guide.html to learn more about DataFrame, Dataset, and the operations they support.

Pipeline

Spark provides the pipeline API under Spark ML. As previously stated, a pipeline is comprised of a sequence of stages consisting of transformers and estimators. There are two basic types of pipeline stages, called Transformer and Estimator.

A transformer takes a dataset as an input and produces an augmented dataset as the output so that the output can be fed to the next step. For example, Tokenizer and HashingTF are two transformers. Tokenizer transforms a dataset with text into a dataset with tokenized words. A HashingTF, on the other hand, produces the term frequencies. The concept of tokenization and HashingTF is commonly used in text mining and text analytics.

On the contrary, an estimator must be the first on the input dataset to produce a model. In this case, the model itself will be used as the transformer for transforming the input dataset into the augmented output dataset. For example, a Logistic Regression or linear regression can be used as an estimator after fitting the training dataset with corresponding labels and features.

After that, it produces a logistic or linear regression model. It implies that developing a pipeline is easy and simple. Well, all you need is to declare the required stages, then configure the related stage's parameters; finally, chain them in a pipeline object, as shown in Figure 2:

Pipeline

Figure 2: Spark ML pipeline model using logistic regression estimator (DS indicates data store and the steps inside the dashed line only happen during pipeline fitting)

If you look at Figure 2, the fitted model consists of a tokenizer, a hashingTF feature extractor, and a fitted logistic regression model. The fitted pipeline model acts as a transformer that can be used for prediction, model validation, model inspection, and finally, model deployment. However, increasing the performance in terms of prediction accuracy, the model itself needs to be tuned. We will discuss more about how to tune a machine learning model in Chapter 7, Tuning Machine Learning Model.

To show the pipelining technique more practically, the following section shows how to create a practical pipeline for cancer diagnosis using Spark ML and MLlib.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset