How to do it...

  1. Start the Spark shell:
        $ spark-shell
  1. Do the required imports:
        scala> import org.apache.spark.mllib.util.MLUtils
scala> import org.apache.spark.mllib.classification.SVMWithSGD
  1. Load the data as an RDD:
        scala> val data = 
MLUtils.loadLibSVMFile(sc,"s3a://sparkcookbook
/medicaldata/diabetes.libsvm")
  1. Count the number of records:
        scala> data.count
  1. Divide the dataset into equal halves of training data and testing data:
        scala> val trainingAndTest = data.randomSplit(Array(0.5,0.5))
  1. Assign the training and test data:
        scala> val trainingData = trainingAndTest(0)
scala> val testData = trainingAndTest(1)
  1. Train the algorithm and build the model for 100 iterations (you can try different iterations, but at a certain point of inflection , you'll see that the results start to converge and that point of inflection is the right number of iterations to choose):
        scala> val model = SVMWithSGD.train(trainingData,100)
  1. Now you can use this model to predict a label for any dataset. Predict the label for the first point in the test data:
        scala> val label = model.predict(testData.first.features)
  1. Create a tuple that has the first value as a prediction for the test data and the second value as the actual label, which will help us compute the accuracy of our algorithm:
        scala> val predictionsAndLabels = testData.map( r => 
(model.predict(r.features),r.label))
  1. You can count how many records have predictions and actual label mismatches:
        scala> predictionsAndLabels.filter(p => p._1 != p._2).count
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset