There's more...

House size is just one predictor variable. A house price depends upon other variables, such as the lot size, age of the house, and so on. The more variables you have, the better your prediction will be.

In this recipe, since the dataset is small and we are just getting started, we have just checked the prediction for one vector. In reality, we need to compute the area under the curve and the area under ROC for measuring accuracy.

Let's do the same exercise with more data. The link for this is http://sparkcookbook.amazonaws.com/housingdata/realestate.libsvm.

The data in the preceding link contains housing data for 2,800 houses. Let us repeat the same exercise for the aforementioned dataset:

Start the Spark shell:

$ spark-shell

Do the necessary imports:

scala> import org.apache.spark.ml.regression.LinearRegression
scala> import org.apache.spark.ml.evaluation.RegressionEvaluator

Load the data in Spark as a dataset:

scala> val data = spark.read.format("libsvm").load("s3a://sparkcookbook/housingdata/realestate.libsvm")

Divide the data into training and test datasets:

scala> val Array(training, test) = data.randomSplit(Array(0.7, 0.3))

Instantiate the LinearRegression object:

scala> val lr = new LinearRegression()

Fit/train the model:

scala> val model = lr.fit(training)

Make predictions on the test dataset:

scala> val predictions = model.transform(test)

Instantiate RegressionEvaluator:

scala> val evaluator = new RegressionEvaluator()

Evaluate the predictions:

scala> evaluator.evaluate(predictions)

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...