- Start the Spark shell:
$ spark-shell
- Do the necessary imports:
scala> import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.clustering.KMeans
- Create a DataFrame with the features vector:
scala> val data = spark.createDataFrame(Seq(
Vectors.dense(12839,2405),
Vectors.dense(10000,2200),
Vectors.dense(8040,1400),
Vectors.dense(13104,1800),
Vectors.dense(10000,2351),
Vectors.dense(3049,795),
Vectors.dense(38768,2725),
Vectors.dense(16250,2150),
Vectors.dense(43026,2724),
Vectors.dense(44431,2675),
Vectors.dense(40000,2930),
Vectors.dense(1260,870),
Vectors.dense(15000,2210),
Vectors.dense(10032,1145),
Vectors.dense(12420,2419),
Vectors.dense(69696,2750),
Vectors.dense(12600,2035),
Vectors.dense(10240,1150),
Vectors.dense(876,665),
Vectors.dense(8125,1430),
Vectors.dense(11792,1920),
Vectors.dense(1512,1230),
Vectors.dense(1276,975),
Vectors.dense(67518,2400),
Vectors.dense(9810,1725),
Vectors.dense(6324,2300),
Vectors.dense(12510,1700),
Vectors.dense(15616,1915),
Vectors.dense(15476,2278),
Vectors.dense(13390,2497.5),
Vectors.dense(1158,725),
Vectors.dense(2000,870),
Vectors.dense(2614,730),
Vectors.dense(13433,2050),
Vectors.dense(12500,3330),
Vectors.dense(15750,1120),
Vectors.dense(13996,4100),
Vectors.dense(10450,1655),
Vectors.dense(7500,1550),
Vectors.dense(12125,2100),
Vectors.dense(14500,2100),
Vectors.dense(10000,1175),
Vectors.dense(10019,2047.5),
Vectors.dense(48787,3998),
Vectors.dense(53579,2688),
Vectors.dense(10788,2251),
Vectors.dense(11865,1906)
).map(Tuple1.apply)).toDF("features")
- Create a k-means estimator for the four clusters:
scala> val kmeans = new KMeans().setK(4).setSeed(1L)
- Train the model:
scala> val model = kmeans.fit(data)
- Now compare the cluster assignments by k-means versus the ones we have done individually. The k-means algorithm gives the cluster IDs, starting from 0. Once you inspect the data, you will find that the mapping between the A and D cluster IDs we gave versus k-means is this: A=>3, B=>0, C=>2, D=>1.
- Pick some data from different parts of the chart and predict which cluster it belongs to.
- Look at the data for the 19th house, which has a lot size of 876 sq. ft. and is priced at $665K:
scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(876,665)).map(Tuple1.apply)).toDF("features"))
scala> prediction.first.get(1)
resxx: Any = 3
- Next, look at the data for the 36th house with a lot size of 15,750 sq. ft. and a price of $1.12 million:
scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(15750,1120)).map(Tuple1.apply)).toDF("features"))
scala> prediction.first.get(1)
resxx: Any = 0
- Now look at the data for the 7th house, which has a lot size of 38,768 sq. ft. and is priced at $2.725 million:
scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(38768,2725)).map(Tuple1.apply)).toDF("features"))
scala> prediction.first
resxx: Any = 2
- Moving on, look at the data for the 16th house, which has a lot size of 69,696 sq. ft. and is priced at $2.75 million:
scala> val prediction = model.transform(spark.createDataFrame
(Seq(Vectors.dense(69696,2750)).map(Tuple1.apply)).toDF("features"))
scala> prediction.first
resxx: Any = 1
You can test prediction capability with more data. Let's do some neighborhood analysis to see what meaning these clusters carry. Most of the houses in cluster 3 are near downtown. Cluster 2 houses are on a hilly terrain.
In this example, we dealt with a very small set of features; common sense and visual inspection would also lead us to the same conclusions. The beauty of the k-means algorithm is that it does the clustering of the data with an unlimited number of features. It is a great tool to use when you have raw data and would like to know the patterns in that data.