K-means clustering with the iris data example

The famous iris data has been used from the UCI machine learning repository for illustration purposes using k-means clustering. The link for downloading the data is here: http://archive.ics.uci.edu/ml/datasets/Iris. The iris data has three types of flowers: setosa, versicolor, and virginica and their respective measurements of sepal length, sepal width, petal length, and petal width. Our task is to group the flowers based on their measurements. The code is as follows:

>>> import os 
""" First change the following directory link to where all input files do exist """ 
>>> os.chdir("D:\Book writing\Codes\Chapter 8") 
 
K-means algorithm from scikit-learn has been utilized in the following example 
 
# K-means clustering 
>>> import numpy as np 
>>> import pandas as pd 
>>> import matplotlib.pyplot as plt 
>>> from scipy.spatial.distance import cdist, pdist 
 
>>> from sklearn.cluster import KMeans 
>>> from sklearn.metrics import silhouette_score

>>> iris = pd.read_csv("iris.csv") 
>>> print (iris.head()) 

Following code is used to separate class variable as dependent variable for creating colors in plot and unsupervised learning algorithm applied on given x variables without any target variable does present:

>>> x_iris = iris.drop(['class'],axis=1) 
>>> y_iris = iris["class"] 

As sample metrics, three clusters have been used, but in real life, we do not know how many clusters data will fall under in advance, hence we need to test the results by trial and error. The maximum number of iterations chosen here is 300 in the following, however, this value can also be changed and the results checked accordingly:

>>> k_means_fit = KMeans(n_clusters=3,max_iter=300) 
>>> k_means_fit.fit(x_iris) 
 
>>> print ("
K-Means Clustering - Confusion Matrix

",pd.crosstab(y_iris, k_means_fit.labels_,rownames = ["Actuall"],colnames = ["Predicted"]) )      
>>> print ("
Silhouette-score: %0.3f" % silhouette_score(x_iris, k_means_fit.labels_, metric='euclidean')) 

From the previous confusion matrix, we can see that all the setosa flowers are clustered correctly, whereas 2 out of 50 versicolor, and 14 out of 50 virginica flowers are incorrectly classified.

Again, to reiterate, in real-life examples we do not have the category names in advance, so we cannot measure accuracy, and so on.

Following code is used to perform sensitivity analysis to check how many number of clusters does actually provide better explanation of segments:

>>> for k in range(2,10): 
...     k_means_fitk = KMeans(n_clusters=k,max_iter=300) 
...     k_means_fitk.fit(x_iris) 
...     print ("For K value",k,",Silhouette-score: %0.3f" % silhouette_score(x_iris, k_means_fitk.labels_, metric='euclidean')) 

The silhouette coefficient values in the preceding results shows that K value 2 and K value 3 have better scores than all the other values. As a thumb rule, we need to take the next K value of the highest silhouette coefficient. Here, we can say that K value 3 is better. In addition, we also need to see the average within cluster variation value and elbow plot before concluding the optimal K value.

# Avg. within-cluster sum of squares 
>>> K = range(1,10) 
 
>>> KM = [KMeans(n_clusters=k).fit(x_iris) for k in K] 
>>> centroids = [k.cluster_centers_ for k in KM] 
 
>>> D_k = [cdist(x_iris, centrds, 'euclidean') for centrds in centroids] 
 
>>> cIdx = [np.argmin(D,axis=1) for D in D_k] 
>>> dist = [np.min(D,axis=1) for D in D_k] 
>>> avgWithinSS = [sum(d)/x_iris.shape[0] for d in dist] 
 
# Total with-in sum of square 
>>> wcss = [sum(d**2) for d in dist] 
>>> tss = sum(pdist(x_iris)**2)/x_iris.shape[0] 
>>> bss = tss-wcss 
 
# elbow curve - Avg. within-cluster sum of squares 
>>> fig = plt.figure() 
>>> ax = fig.add_subplot(111) 
>>> ax.plot(K, avgWithinSS, 'b*-') 
>>> plt.grid(True) 
>>> plt.xlabel('Number of clusters') 
>>> plt.ylabel('Average within-cluster sum of squares') 

From the elbow plot, it seems that at the value of three, the slope changes drastically. Here, we can select the optimal k-value as three.

# elbow curve - percentage of variance explained 
>>> fig = plt.figure() 
>>> ax = fig.add_subplot(111) 
>>> ax.plot(K, bss/tss*100, 'b*-') 
>>> plt.grid(True) 
>>> plt.xlabel('Number of clusters') 
>>> plt.ylabel('Percentage of variance explained')
>>> plt.show()

Last but not least, the total percentage of variance explained value should be greater than 80 percent to decide the optimal number of clusters. Even here, a k-value of three seems to give a decent value of total variance explained. Hence, we can conclude from all the preceding metrics (silhouette, average within cluster variance, and total variance explained), that three clusters are ideal.

The R code for k-means clustering using iris data is as follows:

setwd("D:\Book writing\Codes\Chapter 8")   
   
iris_data = read.csv("iris.csv")   
x_iris =   iris_data[,!names(iris_data) %in% c("class")]   
y_iris = iris_data$class   
   
km_fit = kmeans(x_iris,centers   = 3,iter.max = 300 )   
   
print(paste("K-Means   Clustering- Confusion matrix"))   
table(y_iris,km_fit$cluster)   
   
mat_avgss = matrix(nrow = 10,   ncol = 2)   
   
# Average within the cluster   sum of square   
print(paste("Avg. Within   sum of squares"))   
for (i in (1:10)){   
  km_fit =   kmeans(x_iris,centers = i,iter.max = 300 )   
  mean_km =   mean(km_fit$withinss)   
  print(paste("K-Value",i,",Avg.within   sum of squares",round(mean_km, 2)))   
  mat_avgss[i,1] = i   
  mat_avgss[i,2] = mean_km   
}   
   
plot(mat_avgss[,1],mat_avgss[,2],type   = 'o',xlab = "K_Value",ylab = "Avg. within sum of square")   
title("Avg. within sum of   squares vs. K-value")   
   
mat_varexp = matrix(nrow = 10,   ncol = 2)   
# Percentage of Variance   explained   
print(paste("Percent.   variance explained"))   
for (i in (1:10)){   
  km_fit =   kmeans(x_iris,centers = i,iter.max = 300 )   
  var_exp =   km_fit$betweenss/km_fit$totss   
  print(paste("K-Value",i,",Percent   var explained",round(var_exp,4)))   
  mat_varexp[i,1]=i   
  mat_varexp[i,2]=var_exp   
}   
   
plot(mat_varexp[,1],mat_varexp[,2],type   = 'o',xlab = "K_Value",ylab = "Percent Var explained")   
title("Avg. within sum of   squares vs. K-value")   
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset