Collaborative filtering is a technique that can be used to discover relationships between people and items (for example, books and music). It works by examining the preferences of a set of users, such as the items they purchase, and then determines which users have similar preferences. Collaborative filtering can be used to build recommender systems. Recommender systems are used by many companies including Amazon, LinkedIn, and Facebook.
In this recipe, we are going to use Apache Mahout to generate book recommendations based on a dataset containing people's book preferences.
You will need to download, compile, and install the following:
Once you have compiled Mahout, add the mahout binary to the system path. In addition, you must set the HADOOP_HOME
environment variable to point to the root folder of your Hadoop installation. You can accomplish this in the bash shell by using the following commands:
$ export PATH=$PATH:/path/to/mahout/bin $ export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
Next, extract the Book-Crossing Dataset to the folder you are currently working on. You should see three files named BX-Books.csv
, BX-Book-Ratings.csv
, and BX-Users.csv
.
Carry out the following steps to perform Collaborative filtering in Mahout:
clean_book_ratings.py
script to transform the BX-Book-Ratings.csv
file into a format the Mahout recommender can use.$ ./clean_book_ratings.py BX-Book-Ratings.csv cleaned_book_ratings.txt
clean_book_users.sh
bash script to transform the BX-Users.csv
file into a format the Mahout recommender can use. Note that the BX-Users.csv
file should be in the folder you are currently working on:$ ./clean_book_users.sh
cleaned_book_ratings.txt
and the cleaned_book_users.txt
files into HDFS:$ hadoop fs –mkdir /user/hadoop/books $ hadoop fs –put cleaned_book_ratings.txt /user/hadoop/books $ hadoop fs –put cleaned_book_users.txt /user/hadoop/books
$ mahout recommenditembased --input /user/hadoop/books/ cleaned_book_ratings.txt --output /user/hadoop/books/recommended --usersFile /user/hadoop/books/cleaned_book_users.txt -s SIMILARITY_LOGLIKELIHOOD
USERID [RECOMMENDED BOOK ISBN:SCORE,...]
. The output should look similar to the following:$ hadoop fs -cat /user/hadoop/books/recommended/part* | head -n1 17 [849911788:4.497727,807503193:4.497536,881030392:4.497536,761528547:4.497536,380724723:4.497536,807533424:4.497536,310203414:4.497536,590344153:4.497536,761536744:4.497536,531000265:4.497536]
print_user_summaries.py
. To print the recommendations for the first 10 users, use 10
for the last argument to print_user_summaries.py
:hadoop fs -cat /user/hadoop/books/recommended/part-r-00000 | ./print_user_summaries.py BX-Books.csv BX-Users.csv BX-Book-Ratings.csv 10 ========== user id = 114073 rated: Digital Fortress : A Thriller with: 9 Angels & Demons with: 10 recommended: Morality for Beautiful Girls (No.1 Ladies Detective Agency) Q Is for Quarry The Last Juror The Da Vinci Code Deception Point A Walk in the Woods: Rediscovering America on the Appalachian Trail (Official Guides to the Appalachian Trail) Tears of the Giraffe (No.1 Ladies Detective Agency) The No. 1 Ladies' Detective Agency (Today Show Book Club #8)
The output from print_user_summaries.py
shows which books the user rated, and then it shows the recommendations generated by Mahout.
The first steps of this recipe required us to clean up the Book-Crossing dataset. The BX-Book-Ratings.csv
file was in a semicolon-delimited format with the following columns:
USER_ID
: The user ID assigned to a personISBN
: The book's ISBN the person reviewedBOOK-RATING
: The rating a person gave to the bookThe Mahout recommendation engine expects the input dataset to be in the following comma-separated format:
USER_ID
: The USER_ID
must be an integerITEM_ID
: The ITEM_ID
must be an integerRATING
: The RATING
must be an integer that increases in order of preference. For example, 1
would mean that the user disliked a book intensely, 10
would mean the user enjoyed the book.Once the transformation was completed on the BX-Book-Ratings.csv
file, we performed a similar transformation on the BX-Users.csv
file. We stripped away most of the information in the BX-Users.csv
file, except for USER_ID
.
Finally, we launch the Mahout recommendation engine. Mahout will launch a series of MapReduce jobs to determine the book recommendations for a given set of users, specified with the
–usersFile
flag. In this example, we wanted Mahout to generate book recommendations for all of the users in the dataset, so we provided the complete USER_ID
list to Mahout. In addition to providing an input path, output path, and user list as command-line arguments to Mahout, we also specified a fourth parameter -s SIMILARITY_LOGLIKELIHOOD
. The -s
flag is used to specify which similarity measure we want Mahout to use, to compare similar book preferences across all users. This recipe used log likelihood because it is a simple and effective algorithm, but Mahout supports many more similarity functions. To see for yourself, run the following command, and examine the options for the -s
flag:
$mahout recommenditembased