How to do it...

Start the Spark shell:

        $ spark-shell

Do the necessary imports:

        scala> import org.apache.spark.ml.linalg.Vectors
        scala> import org.apache.spark.ml.clustering.KMeans

Create a DataFrame with the features vector:

        scala> val data = spark.createDataFrame(Seq(
           Vectors.dense(12839,2405),
           Vectors.dense(10000,2200),
           Vectors.dense(8040,1400),
           Vectors.dense(13104,1800),
           Vectors.dense(10000,2351),
           Vectors.dense(3049,795),
           Vectors.dense(38768,2725),
           Vectors.dense(16250,2150),
           Vectors.dense(43026,2724),
           Vectors.dense(44431,2675),
           Vectors.dense(40000,2930),
           Vectors.dense(1260,870),
           Vectors.dense(15000,2210),
           Vectors.dense(10032,1145),
           Vectors.dense(12420,2419),
           Vectors.dense(69696,2750),
           Vectors.dense(12600,2035),
           Vectors.dense(10240,1150),
           Vectors.dense(876,665),
           Vectors.dense(8125,1430),
           Vectors.dense(11792,1920),
           Vectors.dense(1512,1230),
           Vectors.dense(1276,975),
           Vectors.dense(67518,2400),
           Vectors.dense(9810,1725),
           Vectors.dense(6324,2300),
           Vectors.dense(12510,1700),
           Vectors.dense(15616,1915),
           Vectors.dense(15476,2278),
           Vectors.dense(13390,2497.5),
           Vectors.dense(1158,725),
           Vectors.dense(2000,870),
           Vectors.dense(2614,730),
           Vectors.dense(13433,2050),
           Vectors.dense(12500,3330),
           Vectors.dense(15750,1120),
           Vectors.dense(13996,4100),
           Vectors.dense(10450,1655),
           Vectors.dense(7500,1550),
           Vectors.dense(12125,2100),
           Vectors.dense(14500,2100),
           Vectors.dense(10000,1175),
           Vectors.dense(10019,2047.5),
           Vectors.dense(48787,3998),
           Vectors.dense(53579,2688),
           Vectors.dense(10788,2251),
           Vectors.dense(11865,1906)
        ).map(Tuple1.apply)).toDF("features")

Create a k-means estimator for the four clusters:

        scala> val kmeans = new KMeans().setK(4).setSeed(1L)

Train the model:

        scala> val model = kmeans.fit(data)

Now compare the cluster assignments by k-means versus the ones we have done individually. The k-means algorithm gives the cluster IDs, starting from 0. Once you inspect the data, you will find that the mapping between the A and D cluster IDs we gave versus k-means is this: A=>3, B=>0, C=>2, D=>1.

Pick some data from different parts of the chart and predict which cluster it belongs to.
Look at the data for the 19th house, which has a lot size of 876 sq. ft. and is priced at $665K:

        scala> val prediction = model.transform(spark.createDataFrame
        (Seq(Vectors.dense(876,665)).map(Tuple1.apply)).toDF("features"))
        scala> prediction.first.get(1)
        resxx: Any = 3

Next, look at the data for the 36th house with a lot size of 15,750 sq. ft. and a price of $1.12 million:

        scala> val prediction = model.transform(spark.createDataFrame
        (Seq(Vectors.dense(15750,1120)).map(Tuple1.apply)).toDF("features"))
        scala> prediction.first.get(1)
        resxx: Any = 0

Now look at the data for the 7th house, which has a lot size of 38,768 sq. ft. and is priced at $2.725 million:

        scala> val prediction = model.transform(spark.createDataFrame
        (Seq(Vectors.dense(38768,2725)).map(Tuple1.apply)).toDF("features"))
        scala> prediction.first
resxx: Any = 2

Moving on, look at the data for the 16th house, which has a lot size of 69,696 sq. ft. and is priced at $2.75 million:

        scala>  val prediction = model.transform(spark.createDataFrame
        (Seq(Vectors.dense(69696,2750)).map(Tuple1.apply)).toDF("features"))
        scala> prediction.first
resxx: Any = 1

You can test prediction capability with more data. Let's do some neighborhood analysis to see what meaning these clusters carry. Most of the houses in cluster 3 are near downtown. Cluster 2 houses are on a hilly terrain.

In this example, we dealt with a very small set of features; common sense and visual inspection would also lead us to the same conclusions. The beauty of the k-means algorithm is that it does the clustering of the data with an unlimited number of features. It is a great tool to use when you have raw data and would like to know the patterns in that data.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...