How to do it...

  1. Start the Spark shell:
        $ spark-shell
  1. Do the necessary imports:
        scala> import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.clustering.KMeans
  1. Create a DataFrame with the features vector:
        scala> val data = spark.createDataFrame(Seq(
Vectors.dense(12839,2405),
Vectors.dense(10000,2200),
Vectors.dense(8040,1400),
Vectors.dense(13104,1800),
Vectors.dense(10000,2351),
Vectors.dense(3049,795),
Vectors.dense(38768,2725),
Vectors.dense(16250,2150),
Vectors.dense(43026,2724),
Vectors.dense(44431,2675),
Vectors.dense(40000,2930),
Vectors.dense(1260,870),
Vectors.dense(15000,2210),
Vectors.dense(10032,1145),
Vectors.dense(12420,2419),
Vectors.dense(69696,2750),
Vectors.dense(12600,2035),
Vectors.dense(10240,1150),
Vectors.dense(876,665),
Vectors.dense(8125,1430),
Vectors.dense(11792,1920),
Vectors.dense(1512,1230),
Vectors.dense(1276,975),
Vectors.dense(67518,2400),
Vectors.dense(9810,1725),
Vectors.dense(6324,2300),
Vectors.dense(12510,1700),
Vectors.dense(15616,1915),
Vectors.dense(15476,2278),
Vectors.dense(13390,2497.5),
Vectors.dense(1158,725),
Vectors.dense(2000,870),
Vectors.dense(2614,730),
Vectors.dense(13433,2050),
Vectors.dense(12500,3330),
Vectors.dense(15750,1120),
Vectors.dense(13996,4100),
Vectors.dense(10450,1655),
Vectors.dense(7500,1550),
Vectors.dense(12125,2100),
Vectors.dense(14500,2100),
Vectors.dense(10000,1175),
Vectors.dense(10019,2047.5),
Vectors.dense(48787,3998),
Vectors.dense(53579,2688),
Vectors.dense(10788,2251),
Vectors.dense(11865,1906)
).map(Tuple1.apply)).toDF("features")
  1. Create a k-means estimator for the four clusters:
        scala> val kmeans = new KMeans().setK(4).setSeed(1L)
  1. Train the model:
        scala> val model = kmeans.fit(data)
  1. Now compare the cluster assignments by k-means versus the ones we have done individually. The k-means algorithm gives the cluster IDs, starting from 0. Once you inspect the data, you will find that the mapping between the A and D cluster IDs we gave versus k-means is this: A=>3, B=>0, C=>2, D=>1.
  1. Pick some data from different parts of the chart and predict which cluster it belongs to.
  2. Look at the data for the 19th house, which has a lot size of 876 sq. ft. and is priced at $665K:
        scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(876,665)).map(Tuple1.apply)).toDF("features"))

scala> prediction.first.get(1)
resxx: Any = 3
  1. Next, look at the data for the 36th house with a lot size of 15,750 sq. ft. and a price of $1.12 million:
        scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(15750,1120)).map(Tuple1.apply)).toDF("features"))

scala> prediction.first.get(1)
resxx: Any = 0
  1. Now look at the data for the 7th house, which has a lot size of 38,768 sq. ft. and is priced at $2.725 million:
        scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(38768,2725)).map(Tuple1.apply)).toDF("features"))
scala> prediction.first

resxx: Any = 2
  1. Moving on, look at the data for the 16th house, which has a lot size of 69,696 sq. ft. and is priced at $2.75 million:
        scala>  val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(69696,2750)).map(Tuple1.apply)).toDF("features"))
scala> prediction.first

resxx: Any = 1

You can test prediction capability with more data. Let's do some neighborhood analysis to see what meaning these clusters carry. Most of the houses in cluster 3 are near downtown. Cluster 2 houses are on a hilly terrain.

In this example, we dealt with a very small set of features; common sense and visual inspection would also lead us to the same conclusions. The beauty of the k-means algorithm is that it does the clustering of the data with an unlimited number of features. It is a great tool to use when you have raw data and would like to know the patterns in that data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset