Chapter 7. Clustering Customer Profiles

One of the amazing capabilities of neural networks applying unsupervised learning is their ability to find hidden patterns which even experts may not have any clue about. In this chapter, we're going to explore this fascinating feature through a practical application to find customer and product clusters provided in transactions database. We'll go through a review on unsupervised learning and the clustering task. To demonstrate this application, the reader will be provided with a practical example on customer profiling and it's implementation in Java. The topics of this chapter are:

  • Clustering tasks
  • Cluster analysis
  • Cluster evaluation
  • Applied unsupervised learning
  • Radial basis functions neural network
  • Kohonen network for clustering
  • Handling with types of data
  • Customer profiling
  • Preprocessing
  • Implementation in Java
  • Credit analysis and profiles of customers

Clustering tasks

Clustering is part of a broader set of tasks in data analysis, whose objective is to group elements that look alike, more similar to each other, into clusters or groups. Clustering tasks are fully based on unsupervised learning since there is no need to include any target output data in order to find clusters; instead, the solution designer may choose a number of clusters that they want to group the records into and check the response of the algorithm to it.

Tip

Clustering tasks may seem to overlap with classification tasks with the crucial difference that in clustering there is no need to have a predefined set of classes before the clustering algorithm is run.

One may wish to apply clustering when there is little or no information at all about how the data can be gathered into groups. Provided with dataset, we wish our neural network to identify both the groups and their members. While this may seem easy and straightforward to perform visually in a two-dimensional dataset, as shown in the following figure, with a higher number of dimensions, this task becomes not so trivial to perform and needs an algorithmic solution:

Clustering tasks

In clustering, the number of clusters is not determined by the data, but by the data analyst who is looking to cluster the data. Here the boundaries are a little bit different than those of classification tasks because they depend primarily on the number of clusters.

Cluster analysis

One difficulty in the clustering tasks, and also in unsupervised learning tasks, is the accurate interpretation of the results. While in supervised learning there is a defined target, from which we can derive an error measure or confusion matrix, in unsupervised learning the evaluation of quality is totally different, and also totally dependent on the data itself. The validation criteria involves indexes that assert how well the data distributed across the clusters is, as well as external opinions from experts on the data that are also a measure of quality.

Tip

To illustrate an example, let's suppose a task of clustering of plants given their characteristics (sizes, leave colors, period of fruiting, and so on), and a neural network mistakenly groups cactus with pine trees in the same cluster. A botanist would certainly not endorse the classification based on their specific knowledge on the field that this grouping does not make any sense.

Two major issues happen in clustering. One is the fact that one neural network's output is never activated, meaning that one cluster does not have any data point associated with it. Another one is the case of nonlinear or sparse clusters, which could be erroneously grouped into several clusters while actually there might be only one.

Cluster analysis

Cluster evaluation and validation

Unfortunately, if the neural network clusters badly, one needs either to redefine the number of clusters or perform additional data preprocessing. To evaluate how well the clustered data is, the Davies-Bouldin and Dunn indexes may be applied.

The Davies-Boudin index takes into account the cluster's centroids in order to find inter and intra-distances between clusters and cluster members:

Cluster evaluation and validation

Where n is the number of clusters, ci is the centroid of cluster i, si is the average distance of all elements in cluster i, and d(ci,cj) is the distance between clusters i and j. The smaller the value of DB index, the better the neural network will be considered to the cluster.

However, for dense and sparse clusters, the DB index will not give much useful information. This limitation can be overcome with the Dunn index:

Cluster evaluation and validation

Where d(i,j) is the inter cluster distance between i and j, and d'(k) is the intra cluster distance of cluster k. Here the higher the Dunn index is, the better the clustering will be because although the clusters may be sparse, they still need to be grouped together, and high intra-cluster distances will denote a bad grouping of data.

Implementation

In the CompetitiveLearning class, we are going to implement these indexes:

public double DBIndex(){
  int numberOfClusters = this.neuralNet.getNumberOfOutputs();
  double sum=0.0;
  for(int i=0;i<numberOfClusters;i++){
    double[] index = new double[numberOfClusters];
    for(int j=0;j<numberOfClusters;j++){
      if(i!=j){
        //calculate the average distance for cluster i
        Double Sigmai=averageDistance(i,trainingDataSet);
        Double Sigmaj=averageDistance(j,trainingDataSet);
        Double[] Centeri=neuralNet.getOutputLayer().getNeuron(i).getWeights();
        Double[] Centerj=neuralNet.getOutputLayer().getNeuron(j).getWeights();
        Double distance = getDistance(Centeri,Centerj);
        index[j]=((Sigmai+Sigmaj)/distance);
      }
    }
    sum+=ArrayOperations.max(index);
  }
  return sum/numberOfClusters;
}

public double Dunn(){
  int numberOfClusters = this.neuralNet.getNumberOfOutputs();
  ArrayList<double> interclusterDistance;
  for(int i=0;i<numberOfClusters;i++){
    for(int j=i+1;j<numberOfClusters;j++){
      interClusterDistance.add(minInterClusterDistance (i,j,trainingDataSet);
    }
  }
  ArrayList<double> intraclusterDistance;
  for(int k=0;k<numberOfClusters;k++){
    intraclusterDistance.add(maxIntraClusterDistance(k, trainingDataSet);
  }
  return ArrayOperations.Min(interclusterDistance)/ ArrayOperations.Max(intraclusterDistance);
}

External validation

In some cases, there is already an expected result for clustering, as in the example of plants clustering. This is called external validation. One may apply a neural network with unsupervised learning to cluster data that is already assigned a value. The major difference against the classification lies in the fact that the target outputs are not considered, so the algorithm itself is expected to draw a borderline based only on the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset