Using K-means for Predictive Analytics

K-means is a clustering algorithm that tries to cluster related data points together. However, we should know its working principle and mathematical operations.

How K-means Works

Suppose we have n data points, xi, i = 1...n, that need to be partitioned into k clusters. Now that the target here is to assign a cluster to each data point, K-means aims to find the positions, μi, i=1...k, of the clusters that minimize the distance from the data points to the cluster. Mathematically, the K-means algorithm tries to achieve the goal by solving an equation that is an optimization problem:

How K-means Works

In the previous equation, ci is a set of data points, which when assigned to cluster i andHow K-means Worksis the Euclidean distance to be calculated (we will explain why we should use this distance measurement shortly). Therefore, we can see that the overall clustering operation using K-means is not a trivial one, but a NP-hard optimization problem. This also means that the K-means algorithm not only tries to find the global minima but often gets stuck in different solutions.

Clustering using the K-means algorithm begins by initializing all the coordinates to the centroids. With every pass of the algorithm, each point is assigned to its nearest centroid based on some distance metric, usually the Euclidean distance stated earlier.

Note

Distance calculation: There are other ways to calculate the distance as well. For example, the Chebyshev distance can be used to measure the distance by considering only the most notable dimensions. The Hamming distance algorithm can identify the difference between two strings. Mahalanobis distance can be used to normalize the covariance matrix. The Manhattan distance is used to measure the distance by considering only axis-aligned directions. The Haversine distance is used to measure the great-circle distances between two points on a sphere from the location.

Considering these distance-measuring algorithms, it is clear that the Euclidean distance algorithm would be the most appropriate to solve our purpose of distance calculation in the K-means algorithm. The centroids are then updated to be the centers of all the points assigned to it in that iteration. This repeats until there is a minimal change in the centers. In short, the K-means algorithm is an iterative algorithm and works in two steps:

  1. Cluster assignment step: K-means goes through each of the n data points in the dataset that is assigned to a cluster closest to the k centroids, then the least distant one is picked.
  2. Update step: For each cluster, a new centroid is calculated for all the data points in the cluster. The overall workflow of K-means can be explained using a flowchart, as follows:
How K-means Works

Figure 4: Flowchart of the K-means algorithm (Elbow method is an optional but also an advanced option)

Using K-means for Predicting Neighborhoods

Now, to show an example of clustering using K-means, we will use the Saratoga NY Homes dataset downloaded from http://course1.winona.edu/bdeppa/Stat%20425/Datasets.html as an unsupervised learning technique. The dataset contains several features of the houses located in the suburb of the New York City; for example, price, lot size, waterfront, age, land value, new construct, central air, fuel type, heat type, sewer type, living area, Pct.College, bedrooms, fireplaces, bathrooms, and the number of rooms. However, only a few features have been shown in Table 1:

Using K-means for Predicting Neighborhoods

Table 1: A sample data from the Saratoga NY Homes dataset

The target of this clustering technique is to show an exploratory analysis based on the features of each house in the city for finding possible neighborhoods' of the house located in the same area. Before performing the feature extraction, we need to load and parse the Saratoga NY Homes dataset. However, we will look at this example with step-by-step source codes for better understanding:

  1. Loading required libraries and packages.

    We need some built-in Python libraries, such as os, random, NumPy, and Pandas, for data manipulation; PCA for dimensionality reduction; Matplotlib for plotting; and of course, TensorFlow:

    import os
    import random
    from random import choice, shuffle
    import pandas as pd
    import numpy as np
    import tensorflow as tf
    from sklearn.decomposition import PCA
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import axes3d, Axes3D
  2. Loading, parsing, and preparing a training set.

    Here, the first line is used to ensure the reproducibility of the result. The second line basically reads the dataset from your location and converts it into the Pandas data frame:

    random.seed(12345)
    train = pd.read_csv(os.path.join('input', 'saratoga.csv'))
    x_train = np.array(train.iloc[:, 1:], dtype='float32')

    If you now print the data frame (using print(train)), you should find the dataframe containing headers and data as shown in figure 3:

    Using K-means for Predicting Neighborhoods

    Figure 5: A snapshot of the Saratoga NY Homes dataset

    Well, we have managed to prepare the dataset. Now, the next task is to conceptualize our K-means and write a function/class for it.

  3. Implementing K-means.

    The following is the source code of K-means, which is simple in a TensorFlow way:

    def kmeans(x, n_features, n_clusters, n_max_steps=1000, early_stop=0.0):
        input_vec = tf.constant(x, dtype=tf.float32)
        centroids = tf.Variable(tf.slice(tf.random_shuffle(input_vec), [0, 0], [n_clusters, -1]), dtype=tf.float32)
        old_centroids = tf.Variable(tf.zeros([n_clusters, n_features]), dtype=tf.float32)
        centroid_distance = tf.Variable(tf.zeros([n_clusters, n_features]))
        expanded_vectors = tf.expand_dims(input_vec, 0)
        exanded_centroids = tf.expand_dims(centroids, 1)
        distances = tf.reduce_sum(tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2)
        assignments = tf.argmin(distances, 0)
        means = tf.concat([tf.reduce_mean(
            tf.gather(input_vec, tf.reshape(tf.where(tf.equal(assignments, c)), [1, -1])),
            reduction_indices=[1]) for c in range(n_clusters)], 0)
        save_old_centroids = tf.assign(old_centroids, centroids)
        update_centroids = tf.assign(centroids, means)
        init_op = tf.global_variables_initializer()
        performance = tf.assign(centroid_distance, tf.subtract(centroids, old_centroids))
        check_stop = tf.reduce_sum(tf.abs(performance))
        with tf.Session() as sess:
            sess.run(init_op)
            for step in range(n_max_steps):
                sess.run(save_old_centroids)
                _, centroid_values, assignment_values = sess.run(
                    [update_centroids, centroids, assignments])            
                sess.run(check_stop)
                current_stop_coeficient = check_stop.eval()
                if current_stop_coeficient <= early_stop:
                    break
        return centroid_values, assignment_values

    The previous code contains all the steps required to develop the K-means model, including the distance-based centroid calculation, centroid update, and training parameters required.

  4. Clustering the houses.

    Now the previous function can be invoked with real values, for example, our housing dataset. Since there are many houses with their respective features, it would be difficult to plot the clusters along with all the properties. This is the Principal Component Analysis (PCA) that we discussed in the previous lessons:

    centers, cluster_assignments = kmeans(x_train, len(x_train[0]), 10)
    pca_model = PCA(n_components=3)
    reduced_data = pca_model.fit_transform(x_train)
    reduced_centers = pca_model.transform(centers)

    Well, now we are all set. It would be even better to visualize the clusters as shown in figure 6. For this, we will use mpl_toolkits.mplot3d for 3D projection, as follows:

    plt.subplot(212, projection='3d')
    plt.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], c=cluster_assignments)
    plt.title("Clusters")
    plt.show()
    >>>
    Using K-means for Predicting Neighborhoods

    Figure 6: Clustering the houses with similar properties, for example, price

    Here, we can see that most houses fall in 0 to 100,000 range. The second highest houses fall in the range of 100000 to 200000. However, it's really difficult to separate them. Moreover, the number of predefined clusters that we used is 10, which might not be the most optimal one. Therefore, we need to tune this parameter.

  5. Fine tuning and finding the optimal number of clusters.

    Choosing the right number of clusters often depends on the task. For example, suppose you're planning an event for hundreds of people, both young and old. If you have a budget for only two entertainment options, then you can use the K-means clustering with k = 2 to separate the guests into two age groups. Other times, it's not as obvious what the value of k should be. Automatically figuring out the value of k is a bit more complicated.

    As mentioned earlier, the K-means algorithm tries to minimize the sum of squares of the distance (that is, Euclidean distance), in terms of Within-Cluster Sum of Squares (WCSS).

    Note

    However, if you want to minimize the sum of squares of the distance between the points of each set manually or automatically, you would end up with a model where each cluster is its own cluster center; in this case, this measure would be 0, but it would hardly be a generic enough model.

    Therefore, once you have trained your model by specifying the parameters, you can evaluate the result using WCSS. Technically, it is same as the sum of distances of each observation in each K cluster. The beauty of clustering algorithms as a K-means algorithm is that it does the clustering on the data with an unlimited number of features. It is a great tool to use when you have raw data and would like to know the patterns in that data.

    However, deciding the number of clusters prior to conducting the experiment might not be successful but sometimes may lead to an overfitting problem or an under-fitting one. Also, informally, determining the number of clusters is a separate but an optimization problem to be solved. So, based on this, we can redesign our K-means considering the WCSS value computation, as follows:

    def kmeans(x, n_features, n_clusters, n_max_steps=1000, early_stop=0.0):
        input_vec = tf.constant(x, dtype=tf.float32)
        centroids = tf.Variable(tf.slice(tf.random_shuffle(input_vec), [0, 0], [n_clusters, -1]), dtype=tf.float32)
        old_centroids = tf.Variable(tf.zeros([n_clusters, n_features]), dtype=tf.float32)
        centroid_distance = tf.Variable(tf.zeros([n_clusters, n_features]))
        expanded_vectors = tf.expand_dims(input_vec, 0)
        expanded_centroids = tf.expand_dims(centroids, 1)
        distances = tf.reduce_sum(tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2)
        assignments = tf.argmin(distances, 0)
        means = tf.concat([tf.reduce_mean(        tf.gather(input_vec, tf.reshape(tf.where(tf.equal(assignments, c)), [1, -1])),
            reduction_indices=[1]) for c in range(n_clusters)], 0)
        save_old_centroids = tf.assign(old_centroids, centroids)
        update_centroids = tf.assign(centroids, means)
        init_op = tf.global_variables_initializer()
    
        performance = tf.assign(centroid_distance, tf.subtract(centroids, old_centroids))
        check_stop = tf.reduce_sum(tf.abs(performance))
        calc_wss = tf.reduce_sum(tf.reduce_min(distances, 0))
        with tf.Session() as sess:
            sess.run(init_op)
            for step in range(n_max_steps):
                sess.run(save_old_centroids)
                _, centroid_values, assignment_values = sess.run(
                    [update_centroids, centroids, assignments])            
                sess.run(calc_wss)
                sess.run(check_stop)
                current_stop_coeficient = check_stop.eval()
                wss = calc_wss.eval()
                print(step, current_stop_coeficient)
                if current_stop_coeficient <= early_stop:
                    break
        return centroid_values, assignment_values, wss

    To fine tune the clustering performance, we can use a heuristic approach called Elbow method. We start from K = 2. Then, we run the K-means algorithm by increasing K and observe the value of the cost function (CF) using WCSS. At some point, we should experience a big drop with respect to CF. Nevertheless, the improvement then becomes marginal with an increasing value of K.

    In summary, we can pick the K after the last big drop of WCSS as the optimal one. The K-means includes various parameters such as withiness and betweenness, analyzing which you can find out the performance of K-means:

    • Betweenness: This is the between sum of squares, also called the intra-cluster similarity
    • Withiness: This is the within sum of squares, also called the inter-cluster similarity
    • Totwithiness: This is the sum of all the withiness of all the clusters, also called the total intra-cluster similarity

    Note that a robust and accurate clustering model will have a lower value of withiness and a higher value of betweenness. However, these values depend on the number of clusters that is K, which is chosen before building the model. Now, based on this, we will train the K-means model for different K values that are a number of predefined clusters. We will start K = 2 to 10, as follows:

    wcss_list = []
    for i in range(2, 10):
        centers, cluster_assignments, wcss = kmeans(x_train, len(x_train[0]), i)
        wcss_list.append(wcss)

    Now, let's discuss how we can take the advantage of the Elbow method for determining the number of clusters. We calculated the cost function WCSS as a function of a number of clusters for the K-means algorithm applied to home data based on all the features, as follows:

    plt.figure(figsize=(12, 24))
    plt.subplot(211)
    plt.plot(range(2, 10), wcss_list)
    plt.xlabel('No of Clusters')
    plt.ylabel('WCSS')
    plt.title("WCSS vs Clusters")
    >>>
    Using K-means for Predicting Neighborhoods

    Figure 7: Number of clusters as a function of WCSS

    We will try to reuse this lesson in upcoming examples using K-means too. Now, it can be observed that a big drop occurs when k = 5. Therefore, we chose the number of clusters to be 5 as discussed in figure 7. Basically, this is the one after the last big drop. This means that the optimal number of cluster for our dataset that we need to set before we start training the K-means model is 5.

  6. Clustering analysis.

    From figure 8, it is clear that most houses fall in cluster 3 (4655 houses) and then in cluster 4 (3356 houses). The x-axis shows the price and the y-axis shows the lot size for each house. We can also observe that the cluster 1 has only a few houses and potentially in longer distances, but it is also expensive. So, it is most likely that you will not find a nearer neighborhood to interact with if you buy a house that falls in this cluster. However, if you like more human interaction and budget is a constraint, you should probably try buying a house from cluster 2, 3, 4, or 5:

    Using K-means for Predicting Neighborhoods

    Figure 8: Clusters of neighborhoods, that is,. homogeneous houses fall in same clusters

    To make the analysis, we dumped the output in RStudio and generated the clusters shown in figure 6. The R script can be found on my GitHub repositories at https://github.com/rezacsedu/ScalaAndSparkForBigDataAnalytics. Alternatively, you can write your own script and do the visualization accordingly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset