Using random forest for ethnicity prediction

In the previous section, we have seen how to use H2O for ethnicity prediction. However, we could not achieve better prediction accuracy. Therefore, H2O is not mature enough to compute all the necessary performance metrics.

So why don't we try Spark-based tree ensemble techniques such as Random Forest or GBTs? Because we have seen that in most cases, RF shows better predictive accuracy, so let us try with that one.

In the K-means section, we've already prepared the Spark DataFrame named schemaDF. Therefore, we can simply transform the variables into feature vectors that we described before. Nevertheless, for this, we need to exclude the label column. We can do it using the drop() method as follows:

val featureCols = schemaDF.columns.drop(1)
val assembler =
new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
val assembleDF = assembler.transform(schemaDF).select("features", "Region")
assembleDF.show()

At this point, you can further reduce the dimensionality and extract the most principal components using PCA or any other feature selector algorithm. However, I will leave it up to you. Since Spark expects the label column to be numeric, we have to convert the ethnic group name into numeric. We can use StringIndexer() for this. It is straightforward:

val indexer = 
new StringIndexer()
.setInputCol("Region")
.setOutputCol("label")

val indexedDF = indexer.fit(assembleDF)
.transform(assembleDF)
.select("features", "label")

Then we randomly split the dataset for training and testing. In our case, let's use 75% for the training and the rest for the testing:

val seed = 12345L
val splits = indexedDF.randomSplit(Array(0.75, 0.25), seed)
val (trainDF, testDF) = (splits(0), splits(1))

Since this this a small dataset, considering this fact, we can cache both the train and test set for faster access:

trainDF.cache
testDF.cache
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(1234567L)

Now let's create a paramGrid for searching through decision tree's maxDepth parameter for the best model:

val paramGrid =
new ParamGridBuilder()
.addGrid(rf.maxDepth, 3 :: 5 :: 15 :: 20 :: 25 :: 30 :: Nil)
.addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: Nil)
.addGrid(rf.impurity, "gini" :: "entropy" :: Nil)
.addGrid(rf.maxBins, 3 :: 5 :: 10 :: 15 :: 25 :: 35 :: 45 :: Nil)
.addGrid(rf.numTrees, 5 :: 10 :: 15 :: 20 :: 30 :: Nil)
.build()

val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")

Then we set up the 10-fold cross validation for an optimized and stable model. This will reduce the chances of overfitting:

val numFolds = 10
val crossval =
new CrossValidator()
.setEstimator(rf)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)

Well, now we are ready for the training. So let's train the random forest model with the best hyperparameters setting:

val cvModel = crossval.fit(trainDF)

Now that we have the cross-validated and the best model, why don't we evaluate the model using the test set. Why not? First, we compute the prediction DataFrame for each instance. Then we use the MulticlassClassificationEvaluator() to evaluate the performance since this is a multiclass classification problem.

Additionally, we compute performance metrics such as accuracy, precision, recall, and f1 measure. Note that using RF classifier, we can get weightedPrecision and the weightedRecall:

val predictions = cvModel.transform(testDF)
predictions.show(10)
>>>
Figure 21: Raw prediction probability, true label, and the predicted label using random forest
val metric = 
new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")

val evaluator1 = metric.setMetricName("accuracy")
val evaluator2 = metric.setMetricName("weightedPrecision")
val evaluator3 = metric.setMetricName("weightedRecall")
val evaluator4 = metric.setMetricName("f1")

Now let's compute the classification accuracy, precision, recall, f1 measure and error on test data:

val accuracy = evaluator1.evaluate(predictions)
val precision = evaluator2.evaluate(predictions)
val recall = evaluator3.evaluate(predictions)
val f1 = evaluator4.evaluate(predictions)

Finally, we print the performance metrics:

println("Accuracy = " + accuracy);
println("Precision = " + precision)
println("Recall = " + recall)
println("F1 = " + f1)
println(s"Test Error = ${1 - accuracy}")
>>>
Accuracy = 0.7196470196470195
Precision = 0.7196470196470195
Recall = 0.7196470196470195
F1 = 0.7196470196470195
Test Error = 0.28035298035298046

Yes, it turns out to be a better performer. This is bit unexpected since we hoped to have better predictive accuracy from a DL model, but we did not. As I already stated, we can still try with other parameters of H2O. Anyway, we can now see around 25% improvement using random forest. However, probably, it can still be improved.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset