Clustering an Amazon sales dataset

As explained in the earlier recipe, although very accurate, hierarchical clustering has a time complexity of O(n2log(n)), and therefore is not applicable to large datasets. For example, the Amazon data set we will use has about 0.2 million entries and it will need about 10 tera calculations (about 1013).

This recipe describes how to apply the GRGPFClustering algorithm to Amazon sales dataset. We can find the distance between any two data points (products) using items they have co-purchased, but we do not know how to place those data points in some multi-dimensional space (2D or 3D space) with orthogonal axes . Therefore, we say that the data in a non-Euclidian space .

GRGPFClustering algorithm is a highly scalable algorithm applicable to a non-Euclidian space. It works by first taking a small random sample of the data and clustering it using an algorithm such as hierarchical clustering, and then using those clusters to cluster a large dataset without keeping all the data in memory.

Getting ready

The following steps describe how to ready to cluster the Amazon dataset.

  1. This assumes that you have followed Chapter 1, Getting Hadoop up and running in a Cluster and have installed Hadoop. We will use the HADOOP_HOME to refer to the Hadoop installation directory.
  2. Start Hadoop following the instructions in Chapter 1, Getting Hadoop up and running in a Cluster.
  3. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe in Chapter 1, Getting Hadoop up and running in a Cluster.
  4. This assumes that you have followed the Content-based recommendations recipe of this chapter and have extracted the Amazon product data.

How to do it...

The following steps describe how to get cluster the Amazon dataset.

  1. Unzip the source code for Chapter 8 (chapter8.zip). We will call that folder CHAPTER_8_SRC.
  2. Change the hadoop.home property in the build.xml file under CHAPTER_8_SRC to point to your Hadoop installation directory.
  3. Compile the source by running the ant build command from the CHAPTER_8_SRC directory.
  4. Copy build/lib/hadoop-cookbook-chapter8.jar to your HADOOP_HOME.
  5. Run the MapReduce job using the following command from HADOOP_HOME. It will use the output generated by the first MapReduce task of the first recipe.
    $bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.GRGPFClustering /data/output1 /data/output3
    
  6. Read the results by running the following command.
    $bin/hadoopdfs -cat /data/output3/*
    

You will see that it will print the results as following. Here the key indicates the cluster ID and the rest of the results show customer details of the customers assigned to the cluster.

customerID=A3S764AIU7PCLT,clusterID=0,
review=ASIN=0140296468#title=The New New Thing: A Silicon Valley Story ..

How it works...

As the figure depicts, this tasks has a MapReduce task. You can find the source for the recipe from src/chapter8/GRGPFClustering.java.

How it works...

When initialized, the MapReduce job will load the information about the clusters calculated in the earlier recipe and use those clusters to assign the rest of the dataset to clusters.

public void map(Object key, Text value, Context context) 
  throws IOException, InterruptedException
{
  AmazonCustomeramazonCustomer = 
    new AmazonCustomer(value.toString() .replaceAll("[0-9]+\s+", ""));
  doublemindistance = Double.MAX_VALUE;
  AmazonCustomerclosestCluster = null;
  for (AmazonCustomercentriod : clusterCentrodis)
  {
  double distance = amazonCustomer.getDistance(centriod);
  if (distance <mindistance)
  {
    mindistance = distance;
    closestCluster = centriod;
  }
  amazonCustomer.clusterID = closestCluster.clusterID;
  }
  context.write(new Text(closestCluster.clusterID), 
    new Text(amazonCustomer.toString()));
}

The Map task receives each line in the logfile that contains information about Amazon customer as a different key-value pair. Then, the map task compares the information about the customer against each of the cluster's centroids, and assigns each customer to the cluster that is closest to that customer. Then, it emits the name of the cluster as the key and information about the customer as the value.

Then, Hadoop collects all values for the different keys (clusters) and invokes the reducer once for each cluster. Then each reducer emits the members of each cluster.

public void reduce(Text key, Iterable<Text> values,
  Context context) throws IOException, InterruptedException
{
  for (Text value : values)
  {
    context.write(key, value);
  }
}

The main method of the job works similar to the earlier recipes.

There's more...

It is possible to improve the results by recalculating the cluster centroids in the reducer, splitting any clusters that have customers that are too far apart from each others, and rerunning the GRGPF algorithm with new clusters. More information about this can be found from Chapter 7, Clustering, of the book Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman. The book can be found from http://infolab.stanford.edu/~ullman/mmds.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset