As explained in the earlier recipe, although very accurate, hierarchical clustering has a time complexity of O(n2log(n)), and therefore is not applicable to large datasets. For example, the Amazon data set we will use has about 0.2 million entries and it will need about 10 tera calculations (about 1013).
This recipe describes how to apply the GRGPFClustering algorithm to Amazon sales dataset. We can find the distance between any two data points (products) using items they have co-purchased, but we do not know how to place those data points in some multi-dimensional space (2D or 3D space) with orthogonal axes . Therefore, we say that the data in a non-Euclidian space .
GRGPFClustering algorithm is a highly scalable algorithm applicable to a non-Euclidian space. It works by first taking a small random sample of the data and clustering it using an algorithm such as hierarchical clustering, and then using those clusters to cluster a large dataset without keeping all the data in memory.
The following steps describe how to ready to cluster the Amazon dataset.
HADOOP_HOME
to refer to the Hadoop installation directory.The following steps describe how to get cluster the Amazon dataset.
chapter8.zip
). We will call that folder CHAPTER_8_SRC
.hadoop.home
property in the build.xml
file under CHAPTER_8_SRC
to point to your Hadoop installation directory.ant build
command from the CHAPTER_8_SRC
directory.build/lib/hadoop-cookbook-chapter8.jar
to your HADOOP_HOME
.HADOOP_HOME
. It will use the output generated by the first MapReduce task of the first recipe.$bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.GRGPFClustering /data/output1 /data/output3
$bin/hadoopdfs -cat /data/output3/*
You will see that it will print the results as following. Here the key indicates the cluster ID and the rest of the results show customer details of the customers assigned to the cluster.
customerID=A3S764AIU7PCLT,clusterID=0, review=ASIN=0140296468#title=The New New Thing: A Silicon Valley Story ..
As the figure depicts, this tasks has a MapReduce task. You can find the source for the recipe from src/chapter8/GRGPFClustering.java
.
When initialized, the MapReduce job will load the information about the clusters calculated in the earlier recipe and use those clusters to assign the rest of the dataset to clusters.
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { AmazonCustomeramazonCustomer = new AmazonCustomer(value.toString() .replaceAll("[0-9]+\s+", "")); doublemindistance = Double.MAX_VALUE; AmazonCustomerclosestCluster = null; for (AmazonCustomercentriod : clusterCentrodis) { double distance = amazonCustomer.getDistance(centriod); if (distance <mindistance) { mindistance = distance; closestCluster = centriod; } amazonCustomer.clusterID = closestCluster.clusterID; } context.write(new Text(closestCluster.clusterID), new Text(amazonCustomer.toString())); }
The Map task receives each line in the logfile that contains information about Amazon customer as a different key-value pair. Then, the map task compares the information about the customer against each of the cluster's centroids, and assigns each customer to the cluster that is closest to that customer. Then, it emits the name of the cluster as the key and information about the customer as the value.
Then, Hadoop collects all values for the different keys (clusters) and invokes the reducer once for each cluster. Then each reducer emits the members of each cluster.
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for (Text value : values) { context.write(key, value); } }
The main method of the job works similar to the earlier recipes.
It is possible to improve the results by recalculating the cluster centroids in the reducer, splitting any clusters that have customers that are too far apart from each others, and rerunning the GRGPF algorithm with new clusters. More information about this can be found from Chapter 7, Clustering, of the book Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman. The book can be found from http://infolab.stanford.edu/~ullman/mmds.html.