- Start the Spark shell:
$ spark-shell
- Perform the required imports:
scala> import org.apache.spark.ml.classification.
{RandomForestClassificationModel,RandomForestClassifier}
scala> import
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
- Load and parse the data:
scala> val data =
spark.read.format("libsvm").load("s3a://sparkcookbook/rf")
- Split the data into training and test datasets:
scala> val Array(training, test) = data.randomSplit(Array(0.7, 0.3))
- Create a classification as a tree strategy (random forest also supports regression):
scala> val rf = new RandomForestClassifier().setNumTrees(3)
- Train the model:
scala> val model = rf.fit(training)
- Evaluate the model on test instances and compute the test error:
scala> val predictions = model.transform(test)
scala> val evaluator = new
MulticlassClassificationEvaluator().setMetricName("accuracy")
scala> val accuracy = evaluator.evaluate(predictions)
- Check the model:
scala> model.toDebugString
"RandomForestClassificationModel (uid=rfc_ac46ea5af585) with 3 trees
Tree 0 (weight 1.0):
If (feature 1 <= 0.0)
Predict: 0.0
Else (feature 1 > 0.0)
Predict: 1.0
Tree 1 (weight 1.0):
If (feature 5 <= 0.0)
Predict: 1.0
Else (feature 5 > 0.0)
Predict: 0.0
Tree 2 (weight 1.0):
If (feature 5 <= 0.0)
Predict: 1.0
Else (feature 5 > 0.0)
Predict: 0.0
"
- We used toy data to illustrate the value of random forest, but now, let's do the same exercise on the diabetes data by replacing step 3 with the following and running steps 4 to 7 again:
scala> val data =
spark.read.format("libsvm").load("s3a://sparkcookbook/patientdata")
Now the accuracy has reached 74.6 percent.