By the end of this chapter, you will be able to load and visualize data and clusters with scatter plots; prepare data for cluster analysis; perform centroid clustering with k-means; interpret clustering results and determine the optimal number of clusters for a given dataset.
This chapter will introduce you to unsupervised learning tasks, where algorithms have to automatically learn patterns from data by themselves as no target variables are defined beforehand.
The previous chapters introduced you to very popular and extremely powerful machine learning algorithms. They all have one thing in common, which is that they belong to the same category of algorithms: supervised learning. This kind of algorithm tries to learn patterns based on a specified outcome column (target variable) such as sales, employee churn, or class of customer.
But what if you don't have such a variable in your dataset or you don't want to specify a target variable? Will you still be able to run some machine learning algorithms on it and find interesting patterns? The answer is yes, with the use of clustering algorithms that belong to the unsupervised learning category.
Clustering algorithms are very popular in the data science industry for grouping similar data points and detecting outliers. For instance, clustering algorithms can be used by banks for fraud detection by identifying unusual clusters from the data. They can also be used by e-commerce companies to identify groups of users with similar browsing behaviors, as in the following figures:
Clustering analysis performed on this data would uncover natural patterns by grouping similar data points such that you may get the following result:
The data is now segmented into three customer groups depending on their recurring visits and time spent on the website, and different marketing plans can then be used for each of these groups in order to maximize sales.
In this chapter, you will learn how to perform such analysis using a very famous clustering algorithm called k-means.
k-means is one of the most popular clustering algorithms (if not the most popular) among data scientists due to its simplicity and high performance. Its origins date back as early as 1956, when a famous mathematician named Hugo Steinhaus laid its foundations, but it was a decade later that another researcher called James MacQueen named this approach k-means.
The objective of k-means is to group similar data points (or observations) together that will form a cluster. Think of it as grouping elements close to each other (we will define how to measure closeness later in this chapter). For example, if you were manually analyzing user behavior on a mobile app, you might end up grouping customers who log in quite frequently, or users who make bigger in-app purchases, together. This is the kind of grouping that clustering algorithms such as k-means will automatically find for you from the data.
In this chapter, we will be working with an open source dataset shared publicly by the Australian Taxation Office (ATO). The dataset contains statistics about each postcode (also known as a zip code, which is an identification code used for sorting mail by area) in Australia during the financial year of 2014-15.
Note
The Australian Taxation Office (ATO) dataset can be found in the Packt GitHub repository here: https://packt.live/340xO5t.
The source of the dataset can be found here: https://packt.live/361i1p3.
We will perform cluster analysis on this dataset for two specific variables (or columns): Average net tax and Average total deductions. Our objective is to find groups (or clusters) of postcodes sharing similar patterns in terms of tax received and money deducted. Here is a scatter plot of these two variables:
As part of being a data scientist, you need to analyze the graphs achieved from this dataset and come to conclusions. Let's say you have to analyze manually potential groupings of observations from this dataset. One potential result could be as follows:
But rather having you to manually guess these groupings, it will be better if we can use an algorithm to do it for us. This is what we are going to see in practice in the following exercise, where we'll perform clustering analysis on this dataset.
In this exercise, we will be using k-means clustering on the ATO dataset and observing the different clusters that the dataset divides itself into, after which we will conclude by analyzing the output:
We will be using the import function from Python:
Note
You can create short aliases for the packages you will be calling quite often in your script with the function mentioned in the following code snippet.
import pandas as pd
from sklearn.cluster import KMeans
Note
We will be looking into KMeans (from sklearn.cluster), which you have used in the code here, later in the chapter for a more detailed explanation of it.
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
In the next step, we will use the pandas package to load our data into a DataFrame (think of it as a table, like on an Excel spreadsheet, with a row index and column names).
Our input file is in CSV format, and pandas has a method that can directly read this format, which is .read_csv().
df = pd.read_csv(file_url, usecols=['Postcode', 'Average net tax', 'Average total deductions'])
Now we have loaded the data into a pandas DataFrame.
df.head()
You should get the following output:
df.tail()
You should get the following output:
Now that we have our data, let's jump straight to what we want to do: find clusters.
As you saw in the previous chapters, sklearn provides the exact same APIs for training different machine learning algorithms, such as:
Note
Here, we will use all the default values for the k-means hyperparameters except for the random_state one. Specifying a fixed random state (also called a seed) will help us to get reproducible results every time we have to rerun our code.
kmeans = KMeans(random_state=42)
X = df[['Average net tax', 'Average total deductions']]
kmeans.fit(X)
You should get the following output:
We just ran our first clustering algorithm in just a few lines of code.
y_preds = kmeans.predict(X)
y_preds
You should get the following output:
df['cluster'] = y_preds
df.head()
Note
The predictions from the sklearn predict() method are in the exact same order as the input data. So, the first prediction will correspond to the first row of your DataFrame.
You should get the following output:
Our k-means model has grouped the first two rows into the same cluster, 1. We can see these two observations both have average net tax values around 28,000. The last three data points have been assigned to different clusters (2, 3, and 7, respectively) and we can see their values for both average total deductions and average net tax are very different from each other. It seems that lower values are grouped into cluster 2 while higher values are classified into cluster 3. We are starting to build our understanding of how k-means has decided to group the observations from this dataset.
This is a great start. You have learned how to train (or fit) a k-means model in a few lines of code. Now we can start diving deeper into the magic behind k-means.
After training our k-means algorithm, we will likely be interested in analyzing its results in more detail. Remember, the objective of cluster analysis is to group observations with similar patterns together. But how can we see whether the groupings found by the algorithm are meaningful? We will be looking at this in this section by using the dataset results we just generated.
One way of investigating this is to analyze the dataset row by row with the assigned cluster for each observation. This can be quite tedious, especially if the size of your dataset is quite big, so it would be better to have a kind of summary of the cluster results.
If you are familiar with Excel spreadsheets, you are probably thinking about using a pivot table to get the average of the variables for each cluster. In SQL, you would have probably used a GROUP BY statement. If you are not familiar with either of these, you may think of grouping each cluster together and then calculating the average for each of them. The good news is that this can be easily achieved with the pandas package in Python. Let's see how this can be done with an example.
To create a pivot table similar to an Excel one, we will be using the pivot_table() method from pandas. We need to specify the following parameters for this method:
import numpy as np
df.pivot_table(values=['Average net tax', 'Average total deductions'], index='cluster', aggfunc=np.mean)
Note
We will be using the numpy implementation of mean() as it is more optimized for pandas DataFrames.
In this summary, we can see that the algorithm has grouped the data into eight clusters (clusters 0 to 7). Cluster 0 has the lowest average net tax and total deductions amounts among all the clusters, while cluster 4 has the highest values. With this pivot table, we are able to compare clusters between them using their summarised values.
Using an aggregated view of clusters is a good way of seeing the difference between them, but it is not the only way. Another possibility is to visualize clusters in a graph. This is exactly what we are going to do now.
You may have heard of different visualization packages, such as matplotlib, seaborn, and bokeh, but in this chapter, we will be using the altair package because it is quite simple to use (its API is very similar to sklearn). Let's import it first:
import altair as alt
Then, we will instantiate a Chart() object with our DataFrame and save it into a variable called chart:
chart = alt.Chart(df)
Now we will specify the type of graph we want, a scatter plot, with the .mark_circle() method and will save it into a new variable called scatter_plot:
scatter_plot = chart.mark_circle()
Finally, we need to configure our scatter plot by specifying the names of the columns that will be our x- and y-axes on the graph. We also tell the scatter plot to color each point according to its cluster value with the color option:
scatter_plot.encode(x='Average net tax', y='Average total deductions', color='cluster:N')
Note
You may have noticed that we added :N at the end of the cluster column name. This extra parameter is used in altair to specify the type of value for this column. :N means the information contained in this column is categorical. altair automatically defines the color scheme to be used depending on the type of a column.
You should get the following output:
We can now easily see what the clusters in this graph are and how they differ from each other. We can clearly see that k-means assigned data points to each cluster mainly based on the x-axis variable, which is Average net tax. The boundaries of the clusters are vertical straight lines. For instance, the boundary separating the red and purple clusters is roughly around 18,000. Observations below this limit are assigned to the red cluster (2) and those above to the purple cluster (6).
If you have used visualization tools such as Tableau or PowerBI, you might feel a bit frustrated as this graph is static, you can't hover over each data point to get more information and find, for instance, what is the limit separating the orange cluster from the pink one. But this can be easily achieved with altair and this is one of the reasons we chose to use it We can add some interactions to your chart with minimal code changes.
Let's say we want to add a tooltip that will display the values for the two columns of interest: the postcode and the assigned cluster. With altair, we just need to add a parameter called tooltip in the encode() method with a list of corresponding column names and call the interactive() method just after, as seen in the following code snippet:
scatter_plot.encode(x='Average net tax', y='Average total deductions',color='cluster:N', tooltip=['Postcode', 'cluster', 'Average net tax', 'Average total deductions']).interactive()
You should get the following output:
Now we can easily look at the data points near the cluster boundaries and find out that the threshold used to differentiate the orange cluster (1) from the pink one (7) is close to 32,000 in 'Average Net Tax'. We can also see that postcode 3146 is near this boundary and its average net tax is precisely 31992, and its 'average total deductions' is 4276.
In this exercise, we will learn how to perform clustering analysis with k-means and visualize its results based on postcode values sorted by business income and expenses. The following steps will help you complete this exercise:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
import numpy as np
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
df.tail(10)
You should get the following output:
X = df[['Average total business income', 'Average total business expenses']]
kmeans = KMeans(random_state=8)
kmeans.fit(X)
You should get the following output:
y_preds = kmeans.predict(X)
y_preds[-10:]
You should get the following output:
df['cluster'] = y_preds
df.tail(10)
You should get the following output:
df.pivot_table(values=['Average total business income', 'Average total business expenses'], index='cluster', aggfunc=np.mean)
You should get the following output:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses', color='cluster:N', tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']).interactive()
You should get the following output:
We can see that k-means has grouped the observations into eight different clusters based on the value of the two variables ('Average total business income' and 'Average total business expense'). For instance, all the low-value data points have been assigned to cluster 7, while the ones with extremely high values belong to cluster 2. So, k-means has grouped the data points that share similar behaviors.
You just successfully completed a cluster analysis and visualized its results. You learned how to load a real-world dataset, fit k-means, and display a scatter plot. This is a great start, and we will be delving into more details on how to improve the performance of your model in the sections to come in this chapter.
In the previous sections, we saw how easy it is to fit the k-means algorithm on a given dataset. In our ATO dataset, we found 8 different clusters that were mainly defined by the values of the Average net tax variable.
But you may have asked yourself: "Why 8 clusters? Why not 3 or 15 clusters?" These are indeed excellent questions. The short answer is that we used k-means' default value for the hyperparameter n_cluster, defining the number of clusters to be found, as 8.
As you will recall from Chapter 2, Regression, and Chapter 4, Multiclass Classification (NLP), the value of a hyperparameter isn't learned by the algorithm but has to be set arbitrarily by you prior to training. For k-means, n_cluster is one of the most important hyperparameters you will have to tune. Choosing a low value will lead k-means to group many data points together, even though they are very different from each other. On the other hand, choosing a high value may force the algorithm to split close observations into multiple ones, even though they are very similar.
Looking at the scatter plot from the ATO dataset, eight clusters seems to be a lot. On the graph, some of the clusters look very close to each other and have similar values. Intuitively, just by looking at the plot, you could have said that there were between two and four different clusters. As you can see, this is quite suggestive, and it would be great if there was a function that could help us to define the right number of clusters for a dataset. Such a method does indeed exist, and it is called the Elbow method.
This method assesses the compactness of clusters, the objective being to minimize a value known as inertia. More details and an explanation about this will be provided later in this chapter. For now, think of inertia as a value that says, for a group of data points, how far from each other or how close to each other they are.
Let's apply this method to our ATO dataset. First, we will define the range of cluster numbers we want to evaluate (between 1 and 10) and save them in a DataFrame called clusters. We will also create an empty list called inertia, where we will store our calculated values:
clusters = pd.DataFrame()
clusters['cluster_range'] = range(1, 10)
inertia = []
Next, we will create a for loop that will iterate over the range, fit a k-means model with the specified number of clusters, extract the inertia value, and store it in our list, as in the following code snippet:
for k in clusters['cluster_range']:
kmeans = KMeans(n_clusters=k, random_state=8).fit(X)
inertia.append(kmeans.inertia_)
Now we can use our list of inertia values in the clusters DataFrame:
clusters['inertia'] = inertia
clusters
You should get the following output:
Then, we need to plot a line chart using altair with the mark_line() method. We will specify the 'cluster_range' column as our x-axis and 'inertia' as our y-axis, as in the following code snippet:
alt.Chart(clusters).mark_line().encode(x='cluster_range', y='inertia')
You should get the following output:
Note
You don't have to save each of the altair objects in a separate variable; you can just append the methods one after the other with ".".
Now that we have plotted the inertia value against the number of clusters, we need to find the optimal number of clusters. What we need to do is to find the inflection point in the graph, where the inertia value starts to decrease more slowly (that is, where the slope of the line almost reaches a 45-degree angle). Finding the right inflection point can be a bit tricky. If you picture this line chart as an arm, what we want is to find the center of the Elbow (now you know where the name for this method comes from). So, looking at our example, we will say that the optimal number of clusters is three. If we kept adding more clusters, the inertia would not decrease drastically and add any value. This is the reason why we want to find the middle of the Elbow as the inflection point.
Now let's retrain our Kmeans with this hyperparameter and plot the clusters as shown in the following code snippet:
kmeans = KMeans(random_state=42, n_clusters=3)
kmeans.fit(X)
df['cluster2'] = kmeans.predict(X)
scatter_plot.encode(x='Average net tax', y='Average total deductions',color='cluster2:N',
tooltip=['Postcode', 'cluster', 'Average net tax', 'Average total deductions']
).interactive()
You should get the following output:
This is very different compared to our initial results. Looking at the three clusters, we can see that:
Note
It is worth noticing that the data points are more spread in the third cluster; this may indicate that there are some outliers in this group.
This example showed us how important it is to define the right number of clusters before training a k-means algorithm if we want to get meaningful groups from data. We used a method called the Elbow method to find this optimal number.
In this exercise, we will apply the Elbow method to the same data as in Exercise 5.02, Clustering Australian Postcodes by Business Income and Expenses, to find the optimal number of clusters, before fitting a k-means model:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
Next, we will load the dataset and select the same columns as in Exercise 5.02, Clustering Australian Postcodes by Business Income and Expenses, and print the first five rows.
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
df.head()
You should get the following output:
X = df[['Average total business income', 'Average total business expenses']]
clusters = pd.DataFrame()
inertia = []
Now, use the range function to generate a list containing the range of cluster numbers, from 1 to 15, and assign it to a new column called 'cluster_range' from the 'clusters' DataFrame:
clusters['cluster_range'] = range(1, 15)
for k in clusters['cluster_range']:
kmeans = KMeans(n_clusters=k).fit(X)
inertia.append(kmeans.inertia_)
clusters['inertia'] = inertia
clusters
You should get the following output:
alt.Chart(clusters).mark_line().encode(alt.X('cluster_range'), alt.Y('inertia'))
You should get the following output:
optim_cluster = 4
kmeans = KMeans(random_state=42, n_clusters=optim_cluster)
kmeans.fit(X)
df['cluster2'] = kmeans.predict(X)
df.head()
You should get the following output:
alt.Chart(df).mark_circle().encode(x='Average total business income', y='Average total business expenses',color='cluster2:N', tooltip=['Postcode', 'cluster2', 'Average total business income', 'Average total business expenses']).interactive()
You should get the following output:
You just learned how to find the optimal number of clusters before fitting a k-means model. The data points are grouped into four different clusters in our output here:
The results from Exercise 5.02, Clustering Australian Postcodes by Business Income and Expenses, have eight different clusters, and some of them are very similar to each other. Here, you saw that having the optimal number of clusters provides better differentiation between the groups, and this is why it is one of the most important hyperparameters to be tuned for k-means. In the next section, we will look at two other important hyperparameters for initializing k-means.
Since the beginning of this chapter, we've been referring to k-means every time we've fitted our clustering algorithms. But you may have noticed in each model summary that there was a hyperparameter called init with the default value as k-means++. We were, in fact, using k-means++ all this time.
The difference between k-means and k-means++ is in how they initialize clusters at the start of the training. k-means randomly chooses the center of each cluster (called the centroid) and then assigns each data point to its nearest cluster. If this cluster initialization is chosen incorrectly, this may lead to non-optimal grouping at the end of the training process. For example, in the following graph, we can clearly see the three natural groupings of the data, but the algorithm didn't succeed in identifying them properly:
k-means++ is an attempt to find better clusters at initialization time. The idea behind it is to choose the first cluster randomly and then pick the next ones, those further away, using a probability distribution from the remaining data points. Even though k-means++ tends to get better results compared to the original k-means, in some cases, it can still lead to non-optimal clustering.
Another hyperparameter data scientists can use to lower the risk of incorrect clusters is n_init. This corresponds to the number of times k-means is run with different initializations, the final model being the best run. So, if you have a high number for this hyperparameter, you will have a higher chance of finding the optimal clusters, but the downside is that the training time will be longer. So, you have to choose this value carefully, especially if you have a large dataset.
Let's try this out on our ATO dataset by having a look at the following example.
First, let's run only one iteration using random initialization:
kmeans = KMeans(random_state=14, n_clusters=3, init='random', n_init=1)
kmeans.fit(X)
As usual, we want to visualize our clusters with a scatter plot, as defined in the following code snippet:
df['cluster3'] = kmeans.predict(X)
alt.Chart(df).mark_circle().encode(x='Average net tax', y='Average total deductions',color='cluster3:N',
tooltip=['Postcode', 'cluster', 'Average net tax', 'Average total deductions']
).interactive()
You should get the following output:
Overall, the result is very close to that of our previous run. It is worth noticing that the orange and red clusters have swapped positions due to the different initializations. Also, their boundaries are slightly different: now the value on the plot is around 31,000 while previously the threshold was around 34,000.
Now let's try with five iterations (using the n_init hyperparameter) and k-means++ initialization (using the init hyperparameter):
kmeans = KMeans(random_state=14, n_clusters=3, init='k-means++', n_init=5)
kmeans.fit(X)
df['cluster4'] = kmeans.predict(X)
alt.Chart(df).mark_circle().encode(x='Average net tax', y='Average total deductions',color='cluster4:N',
tooltip=['Postcode', 'cluster', 'Average net tax', 'Average total deductions']
).interactive()
You should get the following output:
Here, the results are very close to the original run with 10 iterations. This means that we didn't have to run so many iterations for k-means to converge and could have saved some time with a lower number.
In this exercise, we will use the same data as in Exercise 5.02, Clustering Australian Postcodes by Business Income and Expenses, and try different values for the init and n_init hyperparameters and see how they affect the final clustering result:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
X = df[['Average total business income', 'Average total business expenses']]
kmeans = KMeans(random_state=1, n_clusters=4, init='random', n_init=1)
kmeans.fit(X)
df['cluster3'] = kmeans.predict(X)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster3:N',
tooltip=['Postcode', 'cluster3', 'Average total business income', 'Average total business expenses']
).interactive()
You should get the following output:
kmeans = KMeans(random_state=1, n_clusters=4, init='random', n_init=10)
kmeans.fit(X)
df['cluster4'] = kmeans.predict(X)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster4:N',
tooltip=['Postcode', 'cluster4', 'Average total business income', 'Average total business expenses']
).interactive()
You should get the following output:
kmeans = KMeans(random_state=1, n_clusters=4, init='random', n_init=100)
kmeans.fit(X)
df['cluster5'] = kmeans.predict(X)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster5:N',
tooltip=['Postcode', 'cluster5', 'Average total business income', 'Average total business expenses']
).interactive()
You should get the following output:
You just learned how to tune the two main hyperparameters responsible for initializing k-means clusters. You have seen in this exercise that increasing the number of iterations with n_init didn't have much impact on the clustering result for this dataset.
In this case, it is better to use a lower value for this hyperparameter as it will speed up the training time. But for a different dataset, you may face a case where the results differ drastically depending on the n_init value. In such a case, you will have to find a value of n_init that is not too small but also not too big. You want to find the sweet spot where the results do not change much compared to the last result obtained with a different value.
We've talked a lot about similarities between data points in the previous sections, but we haven't really defined what this means. You have probably guessed that it has something to do with how close or how far observations are from each other. You are heading in the right direction. It has to do with some sort of distance measure between two points. The one used by k-means is called squared Euclidean distance and its formula is:
If you don't have a statistical background, this formula may look intimidating, but it is actually very simple. It is the sum of the squared difference between the data coordinates. Here, x and y are two data points and the index, i, represents the number of coordinates. If the data has two dimensions, i equals 2. Similarly, if there are three dimensions, then i will be 3.
Let's apply this formula to the ATO dataset.
First, we will grab the values needed – that is, the coordinates from the first two observations – and print them:
Note
In pandas, the iloc method is used to subset the rows or columns of a DataFrame by index. For instance, if we wanted to grab row number 888 and column number 6, we would use the following syntax: dataframe.iloc[888, 6].
x = X.iloc[0,].values
y = X.iloc[1,].values
print(x)
print(y)
You should get the following output:
The coordinates for x are (27555, 2071) and the coordinates for y are (28142, 3804). Here, the formula is telling us to calculate the squared difference between each axis of the two data points and sum them:
squared_euclidean = (x[0] - y[0])**2 + (x[1] - y[1])**2
print(squared_euclidean)
You should get the following output:
3347858
k-means uses this metric to calculate the distance between each data point and the center of its assigned cluster (also called the centroid). Here is the basic logic behind this algorithm:
That's it. The k-means algorithm is as simple as that. We can extract the centroids after fitting a k-means model with cluster_centers_.
Let's see how we can plot the centroids in an example.
First, we fit a k-means model as shown in the following code snippet:
kmeans = KMeans(random_state=42, n_clusters=3, init='k-means++', n_init=5)
kmeans.fit(X)
df['cluster6'] = kmeans.predict(X)
Now extract the centroids into a DataFrame and print them:
centroids = kmeans.cluster_centers_
centroids = pd.DataFrame(centroids, columns=['Average net tax', 'Average total deductions'])
print(centroids)
You should get the following output:
We will plot the usual scatter plot but will assign it to a variable called chart1:
chart1 = alt.Chart(df).mark_circle().encode(x='Average net tax', y='Average total deductions',color='cluster6:N',
tooltip=['Postcode', 'cluster6', 'Average net tax', 'Average total deductions']
).interactive()
chart1
You should get the following output:
Now, to create a second scatter plot only for the centroids called chart2:
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average net tax', y='Average total deductions', color=alt.value('black'),
tooltip=['Average net tax', 'Average total deductions']).interactive()
chart2
You should get the following output:
And now we combine the two charts, which is extremely easy with altair:
chart1 + chart2
You should get the following output:
Now we can easily see which centroids the observations are closest to.
In this exercise, we will be coding the first iteration of k-means in order to assign data points to their closest cluster centroids. The following steps will help you complete the exercise:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
X = df[['Average total business income', 'Average total business expenses']]
business_income_min = df['Average total business income'].min()
business_income_max = df['Average total business income'].max()
business_expenses_min = df['Average total business expenses'].min()
business_expenses_max = df['Average total business expenses'].max()
print(business_income_min)
print(business_income_max)
print(business_expenses_min)
print(business_expenses_max)
You should get the following output:
0
876324
0
884659
import random
random.seed(42)
centroids = pd.DataFrame()
centroids['Average total business income'] = random.sample(range(business_income_min, business_income_max), 4)
centroids['Average total business expenses'] = random.sample(range(business_expenses_min, business_expenses_max), 4)
centroids['cluster'] = centroids.index
centroids
You should get the following output:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses',
color=alt.value('orange'),
tooltip=['Postcode', 'Average total business income', 'Average total business expenses']
).interactive()
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses',
color=alt.value('black'),
tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()
chart1 + chart2
You should get the following output:
def squared_euclidean(data_x, data_y, centroid_x, centroid_y, ):
return (data_x - centroid_x)**2 + (data_y - centroid_y)**2
data_x = df.at[0, 'Average total business income']
data_y = df.at[0, 'Average total business expenses']
distances = [squared_euclidean(data_x, data_y, centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
distances
You should get the following output:
[215601466600, 10063365460, 34245932020, 326873037866]
cluster_index = distances.index(min(distances))
df.at[0, 'cluster'] = cluster_index
df.head()
You should get the following output:
distances = [squared_euclidean(df.at[1, 'Average total business income'], df.at[1, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[1, 'cluster'] = distances.index(min(distances))
distances = [squared_euclidean(df.at[2, 'Average total business income'], df.at[2, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[2, 'cluster'] = distances.index(min(distances))
distances = [squared_euclidean(df.at[3, 'Average total business income'], df.at[3, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[3, 'cluster'] = distances.index(min(distances))
distances = [squared_euclidean(df.at[4, 'Average total business income'], df.at[4, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[4, 'cluster'] = distances.index(min(distances))
df.head()
You should get the following output:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses',
color='cluster:N',
tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']
).interactive()
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses',
color=alt.value('black'),
tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()
chart1 + chart2
You should get the following output:
In this final result, we can see where the four clusters have been placed in the graph and which cluster the five data points have been assigned to:
You just re-implemented a big part of the k-means algorithm from scratch. You went through how to randomly initialize centroids (cluster centers), calculate the squared Euclidean distance for some data points, find their closest centroid, and assign them to the corresponding cluster. This wasn't easy, but you made it.
You've already learned a lot about the k-means algorithm, and we are close to the end of this chapter. In this final section, we will not talk about another hyperparameter (you've already been through the main ones) but a very important topic: data processing.
Fitting a k-means algorithm is extremely easy. The trickiest part is making sure the resulting clusters are meaningful for your project, and we have seen how we can tune some hyperparameters to ensure this. But handling input data is as important as all the steps you have learned about so far. If your dataset is not well prepared, even if you find the best hyperparameters, you will still get some bad results.
Let's have another look at our ATO dataset. In the previous section, Calculating the Distance to the Centroid, we found three different clusters, and they were mainly defined by the 'Average net tax' variable. It was as if k-means didn't take into account the second variable, 'Average total deductions', at all. This is in fact due to these two variables having very different ranges of values and the way that squared Euclidean distance is calculated.
Squared Euclidean distance is weighted more toward high-value variables. Let's take an example to illustrate this point with two data points called A and B with respective x and y coordinates of (1, 50000) and (100, 100000). The squared Euclidean distance between A and B will be (100000 - 50000)^2 + (100 - 1)^2. We can clearly see that the result will be mainly driven by the difference between 100,000 and 50,000: 50,000^2. The difference of 100 minus 1 (99^2) will account for very little in the final result.
But if you look at the ratio between 100,000 and 50,000, it is a factor of 2 (100,000 / 50,000 = 2), while the ratio between 100 and 1 is a factor of 100 (100 / 1 = 100). Does it make sense for the higher-value variable to "dominate" the clustering result? It really depends on your project, and this situation may be intended. But if you want things to be fair between the different axes, it's preferable to bring them all into a similar range of values before fitting a k-means model. This is the reason why you should always consider standardizing your data before running your k-means algorithm.
There are multiple ways to standardize data, and we will have a look at the two most popular ones: min-max scaling and z-score. Luckily for us, the sklearn package has an implementation for both methods.
The formula for min-max scaling is very simple: on each axis, you need to remove the minimum value for each data point and divide the result by the difference between the maximum and minimum values. The scaled data will have values ranging between 0 and 1:
Let's look at min-max scaling with sklearn in the following example.
First, we import the relevant class and instantiate an object:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
Then, we fit it to our dataset:
min_max_scaler.fit(X)
You should get the following output:
And finally, call the transform() method to standardize the data:
X_min_max = min_max_scaler.transform(X)
X_min_max
You should get the following output:
Now we print the minimum and maximum values of the min-max-scaled data for both axes:
X_min_max[:,0].min(), X_min_max[:,0].max(), X_min_max[:,1].min(), X_min_max[:,1].max()
You should get the following output:
We can see that both axes now have their values sitting between 0 and 1.
The z-score is calculated by removing the overall average from the data point and dividing the result by the standard deviation for each axis. The distribution of the standardized data will have a mean of 0 and a standard deviation of 1:
To apply it with sklearn, first, we have to import the relevant StandardScaler class and instantiate an object:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
This time, instead of calling fit() and then transform(), we use the fit_transform() method:
X_scaled = standard_scaler.fit_transform(X)
X_scaled
You should get the following output:
Now we'll look at the minimum and maximum values for each axis:
X_scaled[:,0].min(), X_scaled[:,0].max(), X_scaled[:,1].min(), X_scaled[:,1].max()
You should get the following output:
The value ranges for both axes are much lower now and we can see that their maximum values are around 9 and 18, which indicates that there are some extreme outliers in the data.
Now, to fit a k-means model and plot a scatter plot on the z-score-standardized data with the following code snippet:
kmeans = KMeans(random_state=42, n_clusters=3, init='k-means++', n_init=5)
kmeans.fit(X_scaled)
df['cluster7'] = kmeans.predict(X_scaled)
alt.Chart(df).mark_circle().encode(x='Average net tax', y='Average total deductions',color='cluster7:N',
tooltip=['Postcode', 'cluster7', 'Average net tax', 'Average total deductions']
).interactive()
You should get the following output:
k-means results are very different from the standardized data. Now we can see that there are two main clusters (blue and orange) and their boundaries are not straight vertical lines anymore but diagonal. So, k-means is actually taking into consideration both axes now. The red cluster contains much fewer data points compared to previous iterations, and it seems it is grouping all the extreme outliers with high values together. If your project was about detecting anomalies, you would have found a way here to easily separate outliers from "normal" observations.
In this final exercise, we will standardize the data using min-max scaling and the z-score and fit a k-means model for each method and see their impact on k-means:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
X = df[['Average total business income', 'Average total business expenses']]
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(X)
You should get the following output:
X_min_max = min_max_scaler.transform(X)
X_min_max
You should get the following output:
kmeans = KMeans(random_state=1, n_clusters=4, init='k-means++', n_init=5)
kmeans.fit(X_min_max)
df['cluster8'] = kmeans.predict(X_min_max)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster8:N',
tooltip=['Postcode', 'cluster8', 'Average total business income', 'Average total business expenses']
).interactive()
You should get the following output:
standard_scaler = StandardScaler()
X_scaled = standard_scaler.fit_transform(X)
kmeans = KMeans(random_state=1, n_clusters=4, init='k-means++', n_init=5)
kmeans.fit(X_scaled)
df['cluster9'] = kmeans.predict(X_scaled)
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='Average total business income', y='Average total business expenses',color='cluster9:N',
tooltip=['Postcode', 'cluster9', 'Average total business income', 'Average total business expenses']
).interactive()
You should get the following output:
The k-means clustering results are very similar between min-max and z-score standardization, which are the outputs for Steps 10 and 13. Compared to the results in Exercise 5.04, Using Different Initialization Parameters to Achieve a Suitable Outcome, we can see that the boundaries between clusters 1 and 2 are slightly lower after standardization; indeed, the blue cluster has switched places in the output. If you look at Figure 5.31 of Exercise 5.04, Using Different Initialization Parameters to Achieve a Suitable Outcome, you will notice the switch in cluster 0 (blue) when compared to Figure 5.54 of Exercise 5.05, Finding the Closest Centroids in Our Dataset. The reason why these results are very close to each other is due to the fact that the range of values for the two variables (average total business income and average total business expenses) are almost identical: between 0 and 900,000. Therefore, k-means is not putting more weight toward one of these variables.
You've completed the final exercise of this chapter. You have learned how to preprocess data before fitting a k-means model with two very popular methods: min-max scaling and z-score.
You are working for an international bank. The credit department is reviewing its offerings and wants to get a better understanding of its current customers. You have been tasked with performing customer segmentation analysis. You will perform cluster analysis with k-means to identify groups of similar customers.
The following steps will help you complete this activity:
Note
This is the German Credit Dataset from the UCI Machine Learning Repository.
It is available from the Packt repository at https://packt.live/31L2c2b.
This dataset is in the .dat file format. You can still load the file using read_csv() but you will need to specify the following parameter: header=None, sep= 'ss+' and prefix='X'.
Even though all the columns in this dataset are integers, most of them are actually categorical variables. The data in these columns is not continuous. Only two variables are really numeric. Find and use them for your clustering.
You should get the following output:
Note
The solution to this activity can be found at the following address: https://packt.live/2GbJloz.
You are now ready to perform cluster analysis with the k-means algorithm on your own dataset. This type of analysis is very popular in the industry for segmenting customer profiles as well as detecting suspicious transactions or anomalies.
We learned about a lot of different concepts, such as centroids and squared Euclidean distance. We went through the main k-means hyperparameters: init (initialization method), n_init (number of initialization runs), n_clusters (number of clusters), and random_state (specified seed). We also discussed the importance of choosing the optimal number of clusters, initializing centroids properly, and standardizing data. You have learned how to use the following Python packages: pandas, altair, sklearn, and KMeans.
In this chapter, we only looked at k-means, but it is not the only clustering algorithm. There are quite a lot of algorithms that use different approaches, such as hierarchical clustering, principal component analysis, and the Gaussian mixture model, to name a few. If you are interested in this field, you now have all the basic knowledge you need to explore these other algorithms on your own.
Next, you will see how we can assess the performance of these models and what tools can be used to make them even better.