There's more...

Let's apply the decision tree to the diabetes dataset we worked on in the previous recipe:

Start the Spark shell or the Databricks Cloud shell and do the necessary imports:

        $ spark-shell
        scala> import 
        org.apache.spark.ml.classification.DecisionTreeClassifier
        scala> import 
        org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

Read the diabetes data as a DataFrame:

        scala> val data = 
        spark.read.format("libsvm").option("inferschema","true")
        .load("s3a://sparkcookbook/medicaldata/diabetes.libsvm")

Split it into training and test datasets:

        scala> val Array(trainingData, testData) = 
        data.randomSplit(Array(0.7, 0.3))

Initialize the decision tree classifier:

        scala> val dt = new DecisionTreeClassifier()

Train the model using the training data:

        scala> val model = dt.fit(trainingData)

Do predictions on the test dataset:

        scala> val predictions = model.transform(testData)

Initialize the evaluator:

        scala> val evaluator = new BinaryClassificationEvaluator()

Evaluate the predictions:

        scala> val auroc = evaluator.evaluate(predictions)

Print the area under the curve:

        scala> println(s"Area under ROC = $auroc")
        Area under ROC = 0.7624556737588652

We used the decision tree classifier here without tweaking a hyperparameter and got 76 percent of the area under the curve. Why don't you tweak hyperparameters yourselves and see whether you can improve it even further?

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...