In the previous two chapters, we discussed and understood two important algorithms used in predictive analytics, namely, linear regression and logistic regression. Both of them are very widely used. They are supervised algorithms. If you stress your memory a tad bit and have thoroughly read the previous chapters of the book, you would remember that a supervised algorithm is one where the historical value of an output variable is known from the data. A supervised algorithm uses this value to train and build the model to forecast the value of an output variable for a dataset in future. An unsupervised algorithm, on the other hand, doesn't have the luxury or constraints (different perspectives of looking at it) of the output variable. It uses the values of the predictor variables instead to build a model.
Clustering—the algorithm that we are going to discuss in this chapter—is an unsupervised algorithm. Clustering or segmentation, as the name suggests, categorizes entries in clusters or segments in which the entries are more similar to each other than the entries outside the cluster. The properties of such clusters are then identified and treated separately. Once the clusters are defined, one can identify the properties of the cluster and define plans or strategy separately for each cluster. This results in efficient strategizing and planning for each cluster.
The broad focus of this chapter will be clustering and segmentation and by the end of this chapter, you would be able to learn the following:
Now let us discuss the various aspects of clustering in greater detail.
Clustering basically means the following:
The clustering algorithms work on calculating the similarity or dissimilarity between the observations to group them in clusters.
Let us look at the plot of Monthly Income and Monthly Expense for a group of 400 people. As one can see, there are visible clusters of people whose earnings and expenses are different from people from other clusters, but are very similar to the people in the cluster they belong to:
In the preceding plot, the visible clusters of the people can be identified based on their income and expense levels, as follows:
This analysis can be very helpful if, let's say, an organization is trying to target potential customers for their different range of products. Once the clusters are known, the organization can target different clusters for different ranges of their products. Maybe, they can target the cluster 4 to sell their premium products and cluster 1 and 2 to sell their low-end products. This results in higher conversion rates for the advertisement campaigns.
This was one of the illustrations of how clustering can be advantageous. This was a very simple case with just the two attributes of the potential customers, and we were able to plot it on a 2D graph and look at the clusters. However, this is not the case for most of the time. We need to define some generalized metric for the similarity or dissimilarity of the observations. Also, we will discuss this in detail later in this chapter.
Some of the properties of a good cluster can be listed as follows:
Clustering can have a variety of applications. The following are some of the cases where clustering is used: