Here, after we have created our three datasets, the Apache Spark ML pipeline will be created, and the model is trained by performing the following steps:
- First, you need to import the Apache Spark ML packages that will be needed in the subsequent steps:
frompyspark.ml.featureimportOneHotEncoder,StringIndexer,IndexToString,VectorAssembler
frompyspark.ml.classification importRandomForestClassifier
frompyspark.ml.evaluationimportMulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
- Next, the example uses the StringIndexer function as a transformer to convert all of the string columns into numeric ones:
stringIndexer_label=StringIndexer(inputCol="PRODUCT_LINE",outputCol="label").fit(df_data)
stringIndexer_prof=StringIndexer(inputCol="PROFESSION",outputCol="PROFESSION_IX")
stringIndexer_gend=StringIndexer(inputCol="GENDER",outputCol="GENDER_IX")
stringIndexer_mar = StringIndexer(inputCol="MARITAL_STATUS", outputCol="MARITAL_STATUS_IX")
- In the following step, the example creates a feature vector by combining all features together:
vectorAssembler_features = VectorAssembler(inputCols=["GENDER_IX", "AGE", "MARITAL_STATUS_IX", "PROFESSION_IX"], outputCol="features")
- Next, the estimators that you want to use for classification are defined (random forest is used):
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
- And finally, convert the indexed labels back into original labels:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)
- Now the actual pipeline is built:
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, rf, labelConverter])
At this point in the example, you are ready to train the random forest model by using the pipeline and training data you have just built.