Chapter 7. Clustering Customer Profiles

One of the amazing capabilities of neural networks applying unsupervised learning is their ability to find hidden patterns that even experts may not have any clue about. In this chapter, we're going to explore this fascinating feature through a practical application to find customer clusters by using a transactions database. We'll go through a review on unsupervised learning and the clustering task. To demonstrate this application, the reader will be provided with a practical example on customer profiling and their respective implementations in Java. In this chapter, we will cover the following topics:

  • Clustering Task
    • Cluster Analysis
    • Cluster Evaluation
  • Applied Unsupervised Learning
    • Neural Network of Radial Basis Functions
    • Kohonen Network for Clustering
    • Handling Different Types of Data
  • Customer Profiling
    • Preprocessing
  • Implementation in Java
    • Credit Analysis and Profiles of Customers

Clustering task

Clustering is a part of a broader set of tasks in data analysis, whose objective is to group elements that look alike, more similar to each other, into clusters or groups. A clustering task is fully based on unsupervised learning since there is no need to include any target output data in order to find clusters; instead, the solution designer may choose a number of clusters that he/she wants to group the records into and check the response of the algorithm to it.

Tip

A clustering task may seem to overlap with a classification task with the crucial difference that in clustering, there is no need to have a predefined set of classes before the clustering algorithm is run.

One may wish to apply clustering when there is little or no information at all about the how the data can be gathered into groups. Provided a dataset, we want our neural network to identify both the groups and their members. While this may seem easy and straightforward to perform visually in a two-dimensional dataset, as shown in the following figure, with a higher number of dimensions, this task becomes not so trivial to perform and needs an algorithmic solution. an example of 2-dimensional clustering is shown as follows:

Clustering task

In clustering, the number of clusters is not determined by the data, but by the data analyst who is looking to cluster the data. Here, the boundaries are little bit different than those of classification tasks because they depend primarily on the number of clusters.

Cluster analysis

One difficulty in the clustering tasks, and also in unsupervised learning tasks, is the accurate interpretation of the results. While in supervised learning, there is a defined target from which we can derive an error measure or confusion matrix, in unsupervised learning, the evaluation of quality is totally different and totally dependent on the data itself. The validation criteria involve indexes that assert how well the data is distributed across the clusters as well as external opinions from experts on the data, which is also a measure of quality.

Tip

For example, let's suppose a task of clustering of plants given their characteristics (sizes, leave colors, period of fruiting, and so on). If a neural network mistakenly groups cacti and pine trees in the same cluster, a botanist would certainly not endorse the classification on the basis of his/her specific knowledge in the field and state that this grouping does not make any sense.

Two major issues happen in clustering. One is the fact that one neural network's output is never activated, meaning that one cluster does not have any data point associated with it. The other one is the case of nonlinear or sparse clusters, which could be erroneously grouped into several clusters, while actually, there might be only one, as shown in the following figure:

Cluster analysis

Cluster evaluation and validation

Unfortunately, if the neural network clusters badly, one needs to either redefine the number of clusters or perform additional data preprocessing. To evaluate how good the clustered data is, the Davies–Bouldin and Dunn index may be applied.

The Davies–Bouldin index takes into account the cluster's centroids in order to find the inter- and intra-distances between clusters and cluster members.

Cluster evaluation and validation

Where n is the number of clusters, ci is the centroid of cluster i, σi is the average distance of all elements in cluster i, and d(ci,cj) is the distance between clusters i and j. The smaller the value of the DB index, the stronger will be the consideration of the neural network as a cluster.

However, for dense and sparse clusters, the DB index will not give much useful information. This limitation can be overcome with the Dunn index:

Cluster evaluation and validation

where d(i, j) is the inter-cluster distance between i and j, and d'(k) is the intra-cluster distance of cluster k. Here, the higher the Dunn index is, the better will be the clustering because although the clusters may be sparse, they still need to be grouped together, and high intra-cluster distances will denote a bad grouping of data.

External validation

In some cases there is already an expected result for clustering, as in the example of plant clustering. This is called external validation. One may apply a neural network with unsupervised learning to cluster data that is already assigned a value. The major difference against the classification lies in the fact that the target outputs are not considered, so the algorithm itself is expected to draw a borderline based only on the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset