Let's apply the decision tree to the diabetes dataset we worked on in the previous recipe:
- Start the Spark shell or the Databricks Cloud shell and do the necessary imports:
$ spark-shell
scala> import
org.apache.spark.ml.classification.DecisionTreeClassifier
scala> import
org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
- Read the diabetes data as a DataFrame:
scala> val data =
spark.read.format("libsvm").option("inferschema","true")
.load("s3a://sparkcookbook/medicaldata/diabetes.libsvm")
- Split it into training and test datasets:
scala> val Array(trainingData, testData) =
data.randomSplit(Array(0.7, 0.3))
- Initialize the decision tree classifier:
scala> val dt = new DecisionTreeClassifier()
- Train the model using the training data:
scala> val model = dt.fit(trainingData)
- Do predictions on the test dataset:
scala> val predictions = model.transform(testData)
- Initialize the evaluator:
scala> val evaluator = new BinaryClassificationEvaluator()
- Evaluate the predictions:
scala> val auroc = evaluator.evaluate(predictions)
- Print the area under the curve:
scala> println(s"Area under ROC = $auroc")
Area under ROC = 0.7624556737588652
We used the decision tree classifier here without tweaking a hyperparameter and got 76 percent of the area under the curve. Why don't you tweak hyperparameters yourselves and see whether you can improve it even further?