How to do it...

  1. Start the Spark shell:
         $ spark-shell
  1. Perform the required imports:
        scala> import org.apache.spark.ml.classification.
{RandomForestClassificationModel,RandomForestClassifier}
scala> import
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
  1. Load and parse the data:
        scala> val data = 
spark.read.format("libsvm").load("s3a://sparkcookbook/rf")
  1. Split the data into training and test datasets:
        scala> val Array(training, test) = data.randomSplit(Array(0.7, 0.3))
  1. Create a classification as a tree strategy (random forest also supports regression):
        scala> val rf = new RandomForestClassifier().setNumTrees(3)
  1. Train the model:
        scala> val model = rf.fit(training)
  1. Evaluate the model on test instances and compute the test error:
        scala> val predictions = model.transform(test)
scala> val evaluator = new
MulticlassClassificationEvaluator().setMetricName("accuracy")

scala> val accuracy = evaluator.evaluate(predictions)
  1. Check the model:
        scala> model.toDebugString
"RandomForestClassificationModel (uid=rfc_ac46ea5af585) with 3 trees
Tree 0 (weight 1.0):

If (feature 1 <= 0.0)
Predict: 0.0
Else (feature 1 > 0.0)
Predict: 1.0
Tree 1 (weight 1.0):
If (feature 5 <= 0.0)
Predict: 1.0
Else (feature 5 > 0.0)
Predict: 0.0
Tree 2 (weight 1.0):
If (feature 5 <= 0.0)
Predict: 1.0
Else (feature 5 > 0.0)
Predict: 0.0
"
  1. We used toy data to illustrate the value of random forest, but now, let's do the same exercise on the diabetes data by replacing step 3 with the following and running steps 4 to 7 again: 
        scala> val data = 
spark.read.format("libsvm").load("s3a://sparkcookbook/patientdata")

Now the accuracy has reached 74.6 percent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset