Hierarchical clustering

Many operations such as recommendations and finding relationships use clustering as an integral component. Clustering groups a dataset into several groups using one or more measurements such that the items in the same cluster are rather similar and items in different clusters are more different. For example, given a set of living addresses of patients, clustering can group them into several groups where members of each group are close to each other so that doctors can visit them in most optimal manner.

There are many clustering algorithms; each has different costs and accuracy. Hierarchical clustering is one of the most basic and most accurate algorithms. It works as follows:

  1. First, hierarchical clustering assigns each data point to its own cluster.
  2. It calculates the distance between all cluster pairs and merges the two clusters that are closest to each other.
  3. It repeats the process until only one cluster is left. At each repetition, the algorithm records each cluster pair it has merged. This merging history provides a tree that combines the clusters into larger and larger groups close to the root. Users may take a cut at some place in the tree based on the number of clusters they need.

However, hierarchical clustering has the complexity of O(n2log(n)). In other words, the algorithm would take about Cn2log(n) time for input of size n and a constant C. Hence, often it is too complex to be used with a large dataset. However, it often serves as the basis for many other clustering algorithms.

In this recipe, we will use hierarchical clustering to cluster a sample drawn from the Amazon dataset. It is worth noting that it is not a MapReduce algorithm, but we will use its results in the MapReduce algorithm in the next recipe.

Getting ready

This recipe assumes that you have followed the first recipe and extracted the Amazon product data.

How to do it...

The following steps describe how to run the hierarchical clustering sample:

  1. Copy the results of the first MapReduce job of the last recipe to the local machine.
    $ bin/hadoopdfs -get /data/output1/ product-sales.data
    
  2. Run the following command to do hierarchical clustering:
    $ java –cphadoop-cookbook-chapter8.jar chapter8.HirachicalClusterProcessor product-sales.data
    
  3. The algorithm will generate a file called clusters.data that includes clusters.
  4. You can find the information about clusters from the clusters.data file created in the local directory, which will have the centroid of each cluster. Furthermore, the clusters-raw.data file includes all the points assigned to each cluster.

How it works...

You can find the source for the recipe from src/chapter8/HirachicalClusterProcessor.java.

When executed, the code would read through the input data file and load data for 1000 Amazon customers randomly selected from the input file and perform hierarchical clustering on those customers.

Hierarchical clustering starts by placing each customer in its own cluster. Then it finds the two clusters that are close to each other and merges them into one cluster. Then it recalculates the distance between the new cluster and the old clusters, and repeats the process until it is left with a single cluster.

The getDistance() method of the AmazonCustomer.java class under src/chapter8 shows how the distance between two Amazon clusters is calculated. It uses a variation of Jaccard distance. With the Jaccard distance, if two customers have brought the same item and have given similar reviews to those items, then Jaccard distance will be small. On the other hand, if they have given different reviews, the distance between them will be high.

public double getDistance(ClusterablePoint other)
{
  double distance = 5;
  AmazonCustomer customer1 = (AmazonCustomer)this;
  AmazonCustomer customer2 = (AmazonCustomer)other;
  for(ItemData item1:customer1.itemsBrought)
  {
    for(ItemData item2:customer2.itemsBrought)
    {
      if(item1.equals(item2))
      {
        doubleratingDiff = Math.abs(item1.rating - item2.rating);
        if(ratingDiff< 5)
        {
        distance = distance - (5 - ratingDiff)/5;
        }
        else
        {
          distance = distance + (5 - ratingDiff)/5;
         }
       }
     }
  }
returnMath.min(10,Math.max(0.5, distance));
}

A naive implementation of the algorithm will recalculate all distances between clusters every time two clusters are merged, and resulting algorithm will have O(n3) computational complexity . In other words, the algorithm will take Cn3 amount of time to run with input of size n for some constant C. However, by remembering distances between clusters and only calculating distance from new clusters, the resulting implementation will have O(n2 log(n)) complexity. The following code listing shows the implementation of the algorithm.

public List<Cluster>doHirachicalClustering(List<ClusterablePoint>
  points)
{
  List<Cluster> clusters = 
    new ArrayList<HirachicalClusterProcessor.Cluster>();
  for(ClusterablePointpoint:points)
  {
    clusters.add(new Cluster(point));
  }
  for(inti =0;i<clusters.size();i++)
  {
    for(int j =(i+1);j<clusters.size();j++)
      {
        ClusterPairclusterPair = newClusterPair(clusters.get(i), 
        clusters.get(j));
        addNewPairs(clusterPair);
       }
     }
    while(clusters.size() > 1)
    {
      ClusterPairnextPair = null;
      doubleminDist = Double.MAX_VALUE;
      for(ClusterPair pair:
      pairsSortedByDistance)
      {
      if(pair.distance<minDist)
        {
          nextPair = pair;
          minDist = pair.distance;
        }
      }
     ClusterPair pair = nextPair;
    }
    removePairs(pair);
    Cluster newCluster = pair.merge();
    clusters.remove(pair.cluster1);
    clusters.remove(pair.cluster2);
    //recalcualte pairs for new cluster
    for(Cluster cluster: clusters)
    {
     ClusterPairnewclusterPair = newClusterPair(cluster, newCluster);
     addNewPairs(newclusterPair);
     }
    clusters.add(newCluster);
   }
  return clusters;
}

Finally, the program will output the centroid for each cluster. The centroid is the point within the cluster that has the minimal value of the sum of distances to all other points in the cluster.

There's more...

Among other alternative distance measures are Cosine distance, Edit distance, and Hamming distance. Chapter 7, Clustering, of the book Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman, explains these distances in detail. Also you can learn more about hierarchical clustering from the same book. The book can be found from http://infolab.stanford.edu/~ullman/mmds.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset