There's more...

House size is just one predictor variable. A house price depends upon other variables, such as the lot size, age of the house, and so on. The more variables you have, the better your prediction will be.

In this recipe, since the dataset is small and we are just getting started, we have just checked the prediction for one vector. In reality, we need to compute the area under the curve and the area under ROC for measuring accuracy.

Let's do the same exercise with more data. The link for this is http://sparkcookbook.amazonaws.com/housingdata/realestate.libsvm.

The data in the preceding link contains housing data for 2,800 houses. Let us repeat the same exercise for the aforementioned dataset:

  1. Start the Spark shell:
      $ spark-shell
      
      1. Do the necessary imports:
      scala> import org.apache.spark.ml.regression.LinearRegression
      scala> import org.apache.spark.ml.evaluation.RegressionEvaluator
      1. Load the data in Spark as a dataset:
      scala> val data = spark.read.format("libsvm").load("s3a://sparkcookbook/housingdata/realestate.libsvm")
      
      1. Divide the data into training and test datasets:
      scala> val Array(training, test) = data.randomSplit(Array(0.7, 0.3))
      
      1. Instantiate the LinearRegression object:
      scala> val lr = new LinearRegression()
      1. Fit/train the model:
      scala> val model = lr.fit(training)
      
      1. Make predictions on the test dataset:
      scala> val predictions = model.transform(test)
      
      1. Instantiate RegressionEvaluator:
      scala> val evaluator = new RegressionEvaluator()
      
      1. Evaluate the predictions:
      scala> evaluator.evaluate(predictions)
      
      ..................Content has been hidden....................

      You can't read the all page of ebook, please click here login for view all page.
      Reset