k-means clustering originated from signal processing and is a popular method in data mining. The main intent of k-means clustering is to find some m points of a dataset that can best represent the center of some m-regions in the dataset.
k-means clustering is also known as partition clustering. This means that one needs to specify the number of clusters before any clustering process is started. You can define an objective function that uses the sum of Euclidean distance between a data point and its nearest cluster centroid. One can follow a systematic procedure to minimize this objective function iteratively by finding a brand new set of cluster centers that can lower the value of the objective function iteratively.
k-means clustering is a popular method in cluster analysis. It does not require any assumptions. This means that when a dataset is given and a predetermined number of clusters is labeled as k and when you apply the k-means algorithm, it minimizes the sum-squared error of the distance.
The algorithm is pretty simple to understand as follows:
Let's take a look at a simple example (this can be applied to a large collection of points) using k-means from the sklearn.cluster
package. This example shows that with minimal code, you can accomplish k-means clustering using the scikit-learn
library:
import matplotlib.pyplot as plt from sklearn.cluster import KMeans import csv x=[] y=[] with open('/Users/myhomedir/cluster_input.csv', 'r') as csvf: reader = csv.reader(csvf, delimiter=',') for row in reader: x.append(float(row[0])) y.append(float(row[1])) data=[] for i in range(0,120): data.append([x[i],y[i]]) plt.figure(figsize=(10,10)) plt.xlim(0,12) plt.ylim(0,12) plt.xlabel("X values",fontsize=14) plt.ylabel("Y values", fontsize=14) plt.title("Before Clustering ", fontsize=20) plt.plot(x, y, 'k.', color='#0080ff', markersize=35, alpha=0.6) kmeans = KMeans(init='k-means++', n_clusters=3, n_init=10) kmeans.fit(data) plt.figure(figsize=(10,10)) plt.xlabel("X values",fontsize=14) plt.ylabel("Y values", fontsize=14) plt.title("After K-Means Clustering (from scikit-learn)", fontsize=20) plt.plot(x, y, 'k.', color='#ffaaaa', markersize=45, alpha=0.6) # Plot the centroids as a blue X centroids = kmeans.cluster_centers_ plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, linewidths=3, color='b', zorder=10) plt.show()
Plotting the data before clustering looks like this:
In this example, if we set k=5 for five clusters, then this cluster remains the same, but the other two clusters get split into two to obtain five clusters, as shown in the following diagram: