The pipeline

Here, after we have created our three datasets, the Apache Spark ML pipeline will be created, and the model is trained by performing the following steps:

First, you need to import the Apache Spark ML packages that will be needed in the subsequent steps:

frompyspark.ml.featureimportOneHotEncoder,StringIndexer,IndexToString,VectorAssembler
frompyspark.ml.classification importRandomForestClassifier
frompyspark.ml.evaluationimportMulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

Next, the example uses the StringIndexer function as a transformer to convert all of the string columns into numeric ones:

stringIndexer_label=StringIndexer(inputCol="PRODUCT_LINE",outputCol="label").fit(df_data)
stringIndexer_prof=StringIndexer(inputCol="PROFESSION",outputCol="PROFESSION_IX")
stringIndexer_gend=StringIndexer(inputCol="GENDER",outputCol="GENDER_IX")
stringIndexer_mar = StringIndexer(inputCol="MARITAL_STATUS", outputCol="MARITAL_STATUS_IX")

In the following step, the example creates a feature vector by combining all features together:

vectorAssembler_features = VectorAssembler(inputCols=["GENDER_IX", "AGE", "MARITAL_STATUS_IX", "PROFESSION_IX"], outputCol="features")

Next, the estimators that you want to use for classification are defined (random forest is used):

rf = RandomForestClassifier(labelCol="label", featuresCol="features")

And finally, convert the indexed labels back into original labels:

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

Now the actual pipeline is built:

pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, rf, labelConverter])

At this point in the example, you are ready to train the random forest model by using the pipeline and training data you have just built.

Table of Contents for The pipeline

Create new playlist

Sign In

Sign Up

Table of Contents for
The pipeline