Collaborative filtering-based recommendations

This recipe will use collaborative filtering to make recommendations for customers in the Amazon dataset. As described in the introduction, collaborative filtering uses sales activities about a given user that is common with other users to deduce the best product recommendations for the first user.

To implement collaborative filtering, we will cluster the users based on their behavior, and use items brought by members of a cluster to find recommendations of each member of the cluster. We will use the clusters calculated in the earlier recipe.

Getting ready

The following steps show how to prepare to run the collaborative filtering example:

  1. This assumes that you have followed Chapter 1, Getting Hadoop up and running in a Cluster and have installed Hadoop. We will use the HADOOP_HOME to refer to Hadoop installation directory.
  2. Start Hadoop by following the instructions in the first chapter.
  3. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing a WordCount MapReduce Sample, Bundling it and running it using standalone Hadoop recipe from the first chapter.
  4. This will use the results from the previous recipe of this chapter. Follow it if you have not done so already.

How to do it...

The following steps show how to prepare to run the collaborative filtering example:

  1. Run the MapReduce job using the following command from HADOOP_HOME:
    $ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.ClusterBasedRecommendation/data/output3 /data/output4
    
  2. Read the results by running the following command.
    $bin/hadoopdfs -cat /data/output4/*
    

You will see that it will print the results as following. Here the key is the customer ID and the value is the list of recommendations for that customer.

A1JDFNG3KI9D1V  [ASIN=6300215539#title=The War of the Worlds#salesrank=1#group=Video#rating=4#, ..]

How it works...

How it works...

Collaborative filtering uses the behavior of the users to decide on the best recommendations for each user. For that process, the recipe will use the following steps:

  1. Group customers into several groups such that similar customers are in the same group and different customers are in different groups.
  2. For making recommendations for each customer, we have looked at the other members in the same group and used the items bought by those members assuming that similar users would like to buy similar products.
  3. When there are many recommendations, we have used the Amazon sales rank to select the recommendations.

For grouping the customers, we can use clustering techniques used in the earlier recipes. As a measure of the distance between customers we have used the distance measure introduced in the second recipe of this chapter that uses customer co-purchase information to decide on the similarity between customers.

We have already clustered the customers to different groups in the earlier recipe. We would use those results to make recommendations.

You can find the source for the recipe from src/chapter8/ClusterBasedRecommendation.java. The map task for the job will look like the following:

public void map(Object key, Text value, Context context)
  throws IOException, InterruptedException {
{
  AmazonCustomeramazonCustomer = 
    newAmazonCustomer(value.toString() .replaceAll("[0-9]+\s+", ""));
  context.write(new Text(amazonCustomer.clusterID),
    new Text(amazonCustomer.toString()));
}

The map task receives each line in the logfile as a different key-value pair. It parses the lines using regular expressions and emits cluster ID as the key and the customer information as the value.

Hadoop will group different customer information emitted against the same customer ID and call the reducer once for each customer ID. Then, the reducer walks through the customers assigned to this cluster and creates a list of items as potential recommendations sorted by Amazon sales rank. Then it makes final recommendations for a given user by removing any items that he has already brought.

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
  List<AmazonCustomer>customerList = 
     newArrayList<AmazonCustomer>();
  TreeSet<AmazonCustomer.SortableItemData> highestRated1000Items = 
    newTreeSet<AmazonCustomer.SortableItemData>();
  for (Text value : values) 
  {
    AmazonCustomer customer = 
      newAmazonCustomer(value.toString());
    for (ItemDataitemData : customer.itemsBrought) 
    {
      highestRated1000Items.add(
        customer.newSortableItemData(itemData));
      if (highestRated1000Items.size() > 1000){
        highestRated1000Items.remove(highestRated1000Items.last());
      }
    }
    customerList.add(customer);
    }

    for (AmazonCustomeramazonCustomer : customerList)
    {
      List<ItemData>recemndationList = 
        newArrayList<AmazonCustomer.ItemData>();
      for (SortableItemDatasortableItemData : highestRated1000Items) 
      {
        if (!amazonCustomer.itemsBrought.contains(sortableItemData.itemData))
        {
          recemndationList.add(sortableItemData.itemData);
        }
      }
      ArrayList<ItemData>finalRecomendations = newArrayList<ItemData>();
      for (inti = 0; i< 10; i++) 
      {
        finalRecomendations.add(recemndationList.get(i));
      }

      context.write(new Text(amazonCustomer.customerID), new Text(finalRecomendations.toString()));
    }
}

The main method of the job will work similar to the earlier recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset