There's more...

Let's apply the decision tree to the diabetes dataset we worked on in the previous recipe:

  1. Start the Spark shell or the Databricks Cloud shell and do the necessary imports:
        $ spark-shell
scala> import
org.apache.spark.ml.classification.DecisionTreeClassifier
scala> import
org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
  1. Read the diabetes data as a DataFrame:
        scala> val data = 
spark.read.format("libsvm").option("inferschema","true")
.load("s3a://sparkcookbook/medicaldata/diabetes.libsvm")
  1. Split it into training and test datasets:
        scala> val Array(trainingData, testData) = 
data.randomSplit(Array(0.7, 0.3))
  1. Initialize the decision tree classifier:
        scala> val dt = new DecisionTreeClassifier()
  1. Train the model using the training data:
        scala> val model = dt.fit(trainingData)
  1. Do predictions on the test dataset:
        scala> val predictions = model.transform(testData)
  1. Initialize the evaluator:
        scala> val evaluator = new BinaryClassificationEvaluator()
  1. Evaluate the predictions:
        scala> val auroc = evaluator.evaluate(predictions)
  1. Print the area under the curve:
        scala> println(s"Area under ROC = $auroc")
Area under ROC = 0.7624556737588652
We used the decision tree classifier here without tweaking a hyperparameter and got 76 percent of the area under the curve. Why don't you tweak hyperparameters yourselves and see whether you can improve it even further? 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset