Content-based recommendations

Recommendations are to make suggestions to someone about things that might be of interest to him. For example, we would recommend a good book to a friend who has similar interests. We often find use cases for recommendations in online retail. For example, when you browse for a product, Amazon would suggest other products that were also bought by users who bought this item.

For example, online retail sites such as Amazon have a very large collection of items. Although books are found classified into several categories, often each category has too many to browse one after the other. Recommendations make the user's life easy by helping them find the best products for their tastes. As recommendations make a change of a high sale, online retailers are very much interested about recommendation algorithms.

There are many ways to do recommendations:

  • Content-based recommendations: One could use information about the product to identify similar products. For example, you could use categories, content similarities, and so on to identify products that are similar and recommend them to the users who have already brought one.
  • Collaborative filtering: The other option is to use user's behavior to identify similarities between products. For example, if the same user gave a high rating to the two products, we can argue that there is some similarity between those two products. We will look at an instance of this in the next recipe.

This recipe uses dataset collected from Amazon about products to make content-based recommendations. In the dataset, each product has a list of similar items provided pre-calculated by Amazon. In this recipe, we will use that data to make recommendations.

Getting ready

The following steps depict how to prepare for this recipe.

  1. This recipe assumes that you have followed Chapter 1, Getting Hadoop up and running in a Cluster, and have installed Hadoop. We will use the HADOOP_HOME to refer to the Hadoop installation directory.
  2. Start Hadoop by following the instructions in Chapter 1, Getting Hadoop up and running in a Cluster.
  3. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe in Chapter 1, Getting Hadoop up and running in a Cluster.
  4. We will use HADOOP_HOME to refer to the Hadoop installation directory.

How to do it...

The following steps describe how to run the content-based recommendation recipe.

  1. Download the dataset from Amazon product co-purchasing network metadata available at http://snap.stanford.edu/data/amazon-meta.html and unzip it. We call this directory as DATA_DIR.
  2. Upload the data to HDFS by running following commands from HADOOP_HOME. If the /data directory already exists, clean it up.
    $ bin/hadoopdfs -mkdir /data
    $ bin/hadoopdfs -mkdir /data/input1
    $ bin/hadoopdfs -put <DATA_DIR>/amazon-meta.txt /data/input1
    
  3. Unzip the source code for Chapter 8 (chapter8.zip). We will call that folder as CHAPTER_8_SRC.
  4. Change the hadoop.home property in the CHAPTER_8_SRC/build.xml file to point to your Hadoop installation directory.
  5. Compile the source by running the ant build command from the CHAPTER_8_SRC directory.
  6. Copy build/lib/hadoop-cookbook-chapter8.jar to your HADOOP_HOME.
  7. Run the first Map reduce job through the following command from HADOOP_HOME.
    $ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.MostFrequentUserFinder/data/input1 /data/output1
    
  8. Read the results by running the following command:
    $ bin/hadoopdfs -cat /data/output1/*
    
  9. You will see that the MapReduce job has extracted the purchase data from each customer, and the results will look like following:
    customerID=A1002VY75YRZYF,review=ASIN=0375812253#title=Really Useful Engines (Railway Series)#salesrank=623218#group=Book #rating=4#similar=0434804622|0434804614|0434804630|0679894780|0375827439|,review=ASIN=B000002BMD#title=EverythingMustGo#salesrank=77939#group=Music#rating=4#similar=B00000J5ZX|B000024J5H|B00005AWNW|B000025KKX|B000008I2Z
  10. Run the second Map reduce job through the following command from HADOOP_HOME.
    $ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.ContentBasedRecommendation/data/output1 /data/output2
    
  11. Read the results by running the following command:
    $ bin/hadoopdfs -cat /data/output2/*
    

You will see that it will print the results as follows. Each line of the result contains the customer ID and list of product recommendations for that customer.

A10003PM9DTGHQ  [0446611867, 0446613436, 0446608955, 0446606812, 0446691798, 0446611867, 0446613436, 0446608955, 0446606812, 0446691798]

How it works...

The following listing shows an entry for one product from the dataset. Here, each data entry includes an ID, title, categorization, similar items to this item, and information about users who has brought the item.

Id:   13
ASIN: 0313230269
title: Clockwork Worlds : Mechanized Environments in SF (Contributions to the Study of Science Fiction and Fantasy)
group: Book
salesrank: 2895088
similar: 2  1559360968  1559361247
categories: 3
   |Books[283155]|Subjects[1000]|Literature & Fiction[17]|History & Criticism[10204]|Criticism & Theory[10207]|General[10213]
   |Books[283155]|Subjects[1000]|Science Fiction & Fantasy[25]|Fantasy[16190]|History & Criticism[16203]
   |Books[283155]|Subjects[1000]|Science Fiction & Fantasy[25]|Science Fiction[16272]|History & Criticism[16288]
reviews: total: 2  downloaded: 2  avg rating: 5
    2002-8-5  customer: A14OJS0VWMOSWO  rating: 5  votes:   2  helpful:   1
    2003-3-21  customer:  A2C27IQUH9N1Z  rating: 5  votes:   4  helpful:   4

We have written a new Hadoop data format to read and parse the Amazon product data, and the data format works similar to the format we have written in the Simple Analytics using MapReduce recipe in Chapter 6, Analytics. The source files src/chapter8/AmazonDataReader.java and src/chapter8/AmazonDataFormat.java contain the code for the Amazon data formatter.

Amazon data formatter will parse the dataset and emit the data about each Amazon product as key-value pairs to the map function. The data about each Amazon product is represented as string and the class AmazonCustomer.java class includes code to parse and write out the data about the Amazon customer.

How it works...

This recipe includes two MapReduce tasks. The source for those tasks can be found from src/chapter8/MostFrequentUserFinder.java and src/chapter8/ContentBasedRecommendation.java.

public void map(Object key, Text value, Context context) 
  throws IOException, InterruptedException
{
  List<AmazonCustomer>customerList = 
    AmazonCustomer. parseAItemLine(value.toString());
    for(AmazonCustomer customer:
    customerList){ context.write(new Text(customer.customerID),
      new Text(customer.toString()));
}

The map task of the first MapReduce job receives data about each product in the logfile as a different key-value pair. When the map task receives the product data, it emits the customer ID as the key and the product information as the value for each customer who has bought the product.

Hadoop collects all values for the key, and invokes the reducer once for each key. There will be a reducer task for each customer, and each reducer task will receive all products that have been brought by a customer. The reducer emits the list of items brought by each customer, thus building a customer profile. For limiting the size of the dataset, the reducer will not emit any customer who has brought less than five products.

public void reduce(Text key, Iterable<Text> values, Context context)
  throws IOException, InterruptedException
  {
    AmazonCustomer customer = new AmazonCustomer();
    customer.customerID = key.toString();
    for(Text value: values)
    {
      Set<ItemData>itemsBrought = newAmazonCustomer(
        value.toString()).itemsBrought;
      for(ItemDataitemData: itemsBrought)
      {
        customer.itemsBrought.add(itemData);
      }
    }
    if(customer.itemsBrought.size() > 5)
    {
      context.write( newIntWritable(customer.itemsBrought.size()),
        new Text(customer.toString()));
    }
}

The second MapReduce job uses the data generated from the first MapReduce tasks to make recommendations for each customer. The map task receives data about each customer as the input, and the MapReduce tasks make recommendations using the following three steps:

  1. Each product (item) data from Amazon includes similar items to that item. Given a customer, first the map task creates a list of all similar items for each item that customer has brought.
  2. Then the map task removes any item from the list of similar items that has already brought by the same customer.
  3. Then map task selects ten items as recommendations.

    Here reducer only prints out the results.

    public void map(Object key, Text value, Context context)
      throwsIOException, InterruptedException
    {
      AmazonCustomeramazonCustomer = newAmazonCustomer(
        value.toString() .replaceAll("[0-9]+\s+", ""));
      List<String>recemndations = new ArrayList<String>();
      for (ItemDataitemData : amazonCustomer.itemsBrought)
      {
        recemndations.addAll(itemData.similarItems);
      }
      for (ItemDataitemData : amazonCustomer.itemsBrought)
      {
      recemndations.remove(itemData.itemID);
      }
      ArrayList<String>finalRecemndations = newArrayList<String>();
      for (inti = 0; i<Math.min(10, recemndations.size());  i++) 
      {
        finalRecemndations.add(recemndations.get(i)); 
      }
      context.write(new Text(amazonCustomer.customerID), 
        new Text(finalRecemndations.toString()));
      }
    public void reduce(Text key, Iterable<Text> values, Context context)
      throws IOException, InterruptedException 
    {
      for(Text value: values)
      {
        context.write(key, value);
      }
    }

There's more...

You can learn more about content-based recommendations in Chapter 9, Recommendation Systems, of Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman. This book can be found from http://infolab.stanford.edu/~ullman/mmds.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset