Getting ready

Let's first understand some of the basic concepts in Spark ML. Before that, let's quickly go over how the learning process works. Following are the steps:

A machine learning algorithm is provided a training dataset along with the right hyperparameters.
The result of training is a model. The following figure illustrates the model building by applying machine learning algorithm on training data with hyperparameters:

The model is then used to make predictions on test data as shown here:

In Spark ML, an estimator is provided as a DataFrame (via the fit method), and the output after training is a Transformer:

Now, the Transformer takes one DataFrame as input and outputs another transformed (via the transform method) DataFrame. For example, it can take a DataFrame with the test data and enrich this DataFrame with an additional column for predictions and then output it as shown here:

Transformers are not just limited to doing predictions on models but can also be used to do feature transformation. An easy way to understand feature transformation is to compare it to the map function in RDDs.

A machine learning pipeline is defined as a sequence of stages; each stage can be either an estimator or a Transformer.

The example we are going to use in this recipe is whether or not someone is a basketball player. For this, we are going to have a pipeline of one estimator and one Transformer.

The estimator gets the training data to train the algorithms, and then the Transformer makes predictions.

Please note that both Transformer.transform() and Estimator.fit() are stateless.

For now, assume LogisticRegression to be the machine learning algorithm we are using. We will explain the details about LogisticRegression along with other algorithms in the subsequent chapters.

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready