- Start the Spark shell:
$ spark-shell
- Do the required imports:
scala> import org.apache.spark.mllib.util.MLUtils
scala> import org.apache.spark.mllib.classification.SVMWithSGD
- Load the data as an RDD:
scala> val data =
MLUtils.loadLibSVMFile(sc,"s3a://sparkcookbook
/medicaldata/diabetes.libsvm")
- Count the number of records:
scala> data.count
- Divide the dataset into equal halves of training data and testing data:
scala> val trainingAndTest = data.randomSplit(Array(0.5,0.5))
- Assign the training and test data:
scala> val trainingData = trainingAndTest(0)
scala> val testData = trainingAndTest(1)
- Train the algorithm and build the model for 100 iterations (you can try different iterations, but at a certain point of inflection , you'll see that the results start to converge and that point of inflection is the right number of iterations to choose):
scala> val model = SVMWithSGD.train(trainingData,100)
- Now you can use this model to predict a label for any dataset. Predict the label for the first point in the test data:
scala> val label = model.predict(testData.first.features)
- Create a tuple that has the first value as a prediction for the test data and the second value as the actual label, which will help us compute the accuracy of our algorithm:
scala> val predictionsAndLabels = testData.map( r =>
(model.predict(r.features),r.label))
- You can count how many records have predictions and actual label mismatches:
scala> predictionsAndLabels.filter(p => p._1 != p._2).count