This recipe will use collaborative filtering to make recommendations for customers in the Amazon dataset. As described in the introduction, collaborative filtering uses sales activities about a given user that is common with other users to deduce the best product recommendations for the first user.
To implement collaborative filtering, we will cluster the users based on their behavior, and use items brought by members of a cluster to find recommendations of each member of the cluster. We will use the clusters calculated in the earlier recipe.
The following steps show how to prepare to run the collaborative filtering example:
HADOOP_HOME
to refer to Hadoop installation directory.The following steps show how to prepare to run the collaborative filtering example:
HADOOP_HOME
:$ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.ClusterBasedRecommendation/data/output3 /data/output4
$bin/hadoopdfs -cat /data/output4/*
You will see that it will print the results as following. Here the key is the customer ID and the value is the list of recommendations for that customer.
A1JDFNG3KI9D1V [ASIN=6300215539#title=The War of the Worlds#salesrank=1#group=Video#rating=4#, ..]
Collaborative filtering uses the behavior of the users to decide on the best recommendations for each user. For that process, the recipe will use the following steps:
For grouping the customers, we can use clustering techniques used in the earlier recipes. As a measure of the distance between customers we have used the distance measure introduced in the second recipe of this chapter that uses customer co-purchase information to decide on the similarity between customers.
We have already clustered the customers to different groups in the earlier recipe. We would use those results to make recommendations.
You can find the source for the recipe from src/chapter8/ClusterBasedRecommendation.java
. The map task for the job will look like the following:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { { AmazonCustomeramazonCustomer = newAmazonCustomer(value.toString() .replaceAll("[0-9]+\s+", "")); context.write(new Text(amazonCustomer.clusterID), new Text(amazonCustomer.toString())); }
The map task receives each line in the logfile as a different key-value pair. It parses the lines using regular expressions and emits cluster ID as the key and the customer information as the value.
Hadoop will group different customer information emitted against the same customer ID and call the reducer once for each customer ID. Then, the reducer walks through the customers assigned to this cluster and creates a list of items as potential recommendations sorted by Amazon sales rank. Then it makes final recommendations for a given user by removing any items that he has already brought.
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { List<AmazonCustomer>customerList = newArrayList<AmazonCustomer>(); TreeSet<AmazonCustomer.SortableItemData> highestRated1000Items = newTreeSet<AmazonCustomer.SortableItemData>(); for (Text value : values) { AmazonCustomer customer = newAmazonCustomer(value.toString()); for (ItemDataitemData : customer.itemsBrought) { highestRated1000Items.add( customer.newSortableItemData(itemData)); if (highestRated1000Items.size() > 1000){ highestRated1000Items.remove(highestRated1000Items.last()); } } customerList.add(customer); } for (AmazonCustomeramazonCustomer : customerList) { List<ItemData>recemndationList = newArrayList<AmazonCustomer.ItemData>(); for (SortableItemDatasortableItemData : highestRated1000Items) { if (!amazonCustomer.itemsBrought.contains(sortableItemData.itemData)) { recemndationList.add(sortableItemData.itemData); } } ArrayList<ItemData>finalRecomendations = newArrayList<ItemData>(); for (inti = 0; i< 10; i++) { finalRecomendations.add(recemndationList.get(i)); } context.write(new Text(amazonCustomer.customerID), new Text(finalRecomendations.toString())); } }
The main method of the job will work similar to the earlier recipes.