Recommendations are to make suggestions to someone about things that might be of interest to him. For example, we would recommend a good book to a friend who has similar interests. We often find use cases for recommendations in online retail. For example, when you browse for a product, Amazon would suggest other products that were also bought by users who bought this item.
For example, online retail sites such as Amazon have a very large collection of items. Although books are found classified into several categories, often each category has too many to browse one after the other. Recommendations make the user's life easy by helping them find the best products for their tastes. As recommendations make a change of a high sale, online retailers are very much interested about recommendation algorithms.
There are many ways to do recommendations:
This recipe uses dataset collected from Amazon about products to make content-based recommendations. In the dataset, each product has a list of similar items provided pre-calculated by Amazon. In this recipe, we will use that data to make recommendations.
The following steps depict how to prepare for this recipe.
HADOOP_HOME
to refer to the Hadoop installation directory.HADOOP_HOME
to refer to the Hadoop installation directory.The following steps describe how to run the content-based recommendation recipe.
DATA_DIR
.HADOOP_HOME
. If the /data
directory already exists, clean it up.$ bin/hadoopdfs -mkdir /data $ bin/hadoopdfs -mkdir /data/input1 $ bin/hadoopdfs -put <DATA_DIR>/amazon-meta.txt /data/input1
chapter8.zip
). We will call that folder as CHAPTER_8_SRC
.hadoop.home
property in the CHAPTER_8_SRC/build.xml
file to point to your Hadoop installation directory.ant build
command from the CHAPTER_8_SRC
directory.build/lib/hadoop-cookbook-chapter8.jar
to your HADOOP_HOME
.HADOOP_HOME
.$ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.MostFrequentUserFinder/data/input1 /data/output1
$ bin/hadoopdfs -cat /data/output1/*
customerID=A1002VY75YRZYF,review=ASIN=0375812253#title=Really Useful Engines (Railway Series)#salesrank=623218#group=Book #rating=4#similar=0434804622|0434804614|0434804630|0679894780|0375827439|,review=ASIN=B000002BMD#title=EverythingMustGo#salesrank=77939#group=Music#rating=4#similar=B00000J5ZX|B000024J5H|B00005AWNW|B000025KKX|B000008I2Z
HADOOP_HOME
.$ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.ContentBasedRecommendation/data/output1 /data/output2
$ bin/hadoopdfs -cat /data/output2/*
You will see that it will print the results as follows. Each line of the result contains the customer ID and list of product recommendations for that customer.
A10003PM9DTGHQ [0446611867, 0446613436, 0446608955, 0446606812, 0446691798, 0446611867, 0446613436, 0446608955, 0446606812, 0446691798]
The following listing shows an entry for one product from the dataset. Here, each data entry includes an ID, title, categorization, similar items to this item, and information about users who has brought the item.
Id: 13 ASIN: 0313230269 title: Clockwork Worlds : Mechanized Environments in SF (Contributions to the Study of Science Fiction and Fantasy) group: Book salesrank: 2895088 similar: 2 1559360968 1559361247 categories: 3 |Books[283155]|Subjects[1000]|Literature & Fiction[17]|History & Criticism[10204]|Criticism & Theory[10207]|General[10213] |Books[283155]|Subjects[1000]|Science Fiction & Fantasy[25]|Fantasy[16190]|History & Criticism[16203] |Books[283155]|Subjects[1000]|Science Fiction & Fantasy[25]|Science Fiction[16272]|History & Criticism[16288] reviews: total: 2 downloaded: 2 avg rating: 5 2002-8-5 customer: A14OJS0VWMOSWO rating: 5 votes: 2 helpful: 1 2003-3-21 customer: A2C27IQUH9N1Z rating: 5 votes: 4 helpful: 4
We have written a new Hadoop data format to read and parse the Amazon product data, and the data format works similar to the format we have written in the Simple Analytics using MapReduce recipe in Chapter 6, Analytics. The source files src/chapter8/AmazonDataReader.java
and src/chapter8/AmazonDataFormat.java
contain the code for the Amazon data formatter.
Amazon data formatter will parse the dataset and emit the data about each Amazon product as key-value pairs to the map function. The data about each Amazon product is represented as string and the class AmazonCustomer.java
class includes code to parse and write out the data about the Amazon customer.
This recipe includes two MapReduce tasks. The source for those tasks can be found from src/chapter8/MostFrequentUserFinder.java
and src/chapter8/ContentBasedRecommendation.java
.
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { List<AmazonCustomer>customerList = AmazonCustomer. parseAItemLine(value.toString()); for(AmazonCustomer customer: customerList){ context.write(new Text(customer.customerID), new Text(customer.toString())); }
The map task of the first MapReduce job receives data about each product in the logfile as a different key-value pair. When the map task receives the product data, it emits the customer ID as the key and the product information as the value for each customer who has bought the product.
Hadoop collects all values for the key, and invokes the reducer once for each key. There will be a reducer task for each customer, and each reducer task will receive all products that have been brought by a customer. The reducer emits the list of items brought by each customer, thus building a customer profile. For limiting the size of the dataset, the reducer will not emit any customer who has brought less than five products.
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { AmazonCustomer customer = new AmazonCustomer(); customer.customerID = key.toString(); for(Text value: values) { Set<ItemData>itemsBrought = newAmazonCustomer( value.toString()).itemsBrought; for(ItemDataitemData: itemsBrought) { customer.itemsBrought.add(itemData); } } if(customer.itemsBrought.size() > 5) { context.write( newIntWritable(customer.itemsBrought.size()), new Text(customer.toString())); } }
The second MapReduce job uses the data generated from the first MapReduce tasks to make recommendations for each customer. The map task receives data about each customer as the input, and the MapReduce tasks make recommendations using the following three steps:
Here reducer only prints out the results.
public void map(Object key, Text value, Context context) throwsIOException, InterruptedException { AmazonCustomeramazonCustomer = newAmazonCustomer( value.toString() .replaceAll("[0-9]+\s+", "")); List<String>recemndations = new ArrayList<String>(); for (ItemDataitemData : amazonCustomer.itemsBrought) { recemndations.addAll(itemData.similarItems); } for (ItemDataitemData : amazonCustomer.itemsBrought) { recemndations.remove(itemData.itemID); } ArrayList<String>finalRecemndations = newArrayList<String>(); for (inti = 0; i<Math.min(10, recemndations.size()); i++) { finalRecemndations.add(recemndations.get(i)); } context.write(new Text(amazonCustomer.customerID), new Text(finalRecemndations.toString())); } public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for(Text value: values) { context.write(key, value); } }
You can learn more about content-based recommendations in Chapter 9, Recommendation Systems, of Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman. This book can be found from http://infolab.stanford.edu/~ullman/mmds.html.