k-means clustering

k-means clustering originated from signal processing and is a popular method in data mining. The main intent of k-means clustering is to find some m points of a dataset that can best represent the center of some m-regions in the dataset.

k-means clustering is also known as partition clustering. This means that one needs to specify the number of clusters before any clustering process is started. You can define an objective function that uses the sum of Euclidean distance between a data point and its nearest cluster centroid. One can follow a systematic procedure to minimize this objective function iteratively by finding a brand new set of cluster centers that can lower the value of the objective function iteratively.

k-means clustering is a popular method in cluster analysis. It does not require any assumptions. This means that when a dataset is given and a predetermined number of clusters is labeled as k and when you apply the k-means algorithm, it minimizes the sum-squared error of the distance.

The algorithm is pretty simple to understand as follows:

  • Given is a set of n points (x,y) and a set of k centroids
  • For each (x,y), find the centroid that is closest to that point (which determines the cluster this (x,y) belong to
  • In each cluster, find the median and set this as the centroid of that cluster and repeat this process

Let's take a look at a simple example (this can be applied to a large collection of points) using k-means from the sklearn.cluster package. This example shows that with minimal code, you can accomplish k-means clustering using the scikit-learn library:

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import csv

x=[]
y=[]

with open('/Users/myhomedir/cluster_input.csv', 'r') as csvf:
  reader = csv.reader(csvf, delimiter=',')
    for row in reader:
      x.append(float(row[0]))
      y.append(float(row[1]))

data=[]
for i in range(0,120):
  data.append([x[i],y[i]])

plt.figure(figsize=(10,10))

plt.xlim(0,12)
plt.ylim(0,12)

plt.xlabel("X values",fontsize=14)
plt.ylabel("Y values", fontsize=14)

plt.title("Before Clustering ", fontsize=20)

plt.plot(x, y, 'k.', color='#0080ff', markersize=35, alpha=0.6)

kmeans = KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(data)

plt.figure(figsize=(10,10))

plt.xlabel("X values",fontsize=14)
plt.ylabel("Y values", fontsize=14)

plt.title("After K-Means Clustering (from scikit-learn)", fontsize=20)

plt.plot(x, y, 'k.', color='#ffaaaa', markersize=45, alpha=0.6)

# Plot the centroids as a blue X
centroids = kmeans.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200,
  linewidths=3, color='b', zorder=10)

plt.show()

Plotting the data before clustering looks like this:

k-means clustering
k-means clustering

In this example, if we set k=5 for five clusters, then this cluster remains the same, but the other two clusters get split into two to obtain five clusters, as shown in the following diagram:

k-means clustering
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset