To demonstrate both the content-based filtering and collaborative filtering approaches, we'll build a book-recommendation engine.
In this chapter, we will work with book ratings dataset (Ziegler et al, 2005) collected in a four-week crawl. It contains data on 278,858 members of the Book-Crossing website and 1,157,112 ratings, both implicit and explicit, referring to 271,379 distinct ISBNs. User data is anonymized, but with demographic information. The dataset is available at:
http://www2.informatik.uni-freiburg.de/~cziegler/BX/.
The Book-Crossing dataset comprises three files described at their website as follows:
There are two approaches for loading the data according to where the data is stored: file or database. First, we will take a detailed look at how to load the data from the file, including how to deal with custom formats. At the end, we quickly take a look at how to load the data from a database.
Loading data from file can be achieved with the FileDataModel
class, expecting a comma-delimited file, where each line contains a userID
, itemID
, optional preference
value, and optional timestamp
in the same order, as follows:
userID,itemID[,preference[,timestamp]]
Optional preference accommodates applications with binary preference values, that is, user either expresses a preference for an item or not, without degree of preference, for example, with like/dislike.
A line that begins with hash, #
, or an empty line will be ignored. It is also acceptable for the lines to contain additional fields, which will be ignored.
The DataModel
class assumes the following types:
userID
, itemID
can be parsed as long
preference
value can be parsed as double
timestamp
can be parsed as long
If you are able to provide the dataset in the preceding format, you can simply use the following line to load the data:
DataModel model = new FileDataModel(new File(path));
This class is not intended to be used for very large amounts of data, for example, tens of millions of rows. For that, a JDBC-backed DataModel
and a database are more appropriate.
In real world, however, we cannot always ensure that the input data supplied to us contain only integer values for userID
and itemID
. For example, in our case, itemID
correspond to ISBN book numbers uniquely identifying items, but these are not integers and the FileDataModel
default won't be suitable to process our data.
Now, let's consider how to deal with the case where our itemID
is a string. We will define our custom data model by extending FileDataModel
and overriding the long readItemIDFromString(String)
method in order to read itemID
as a string and convert the it into long
and return a unique long value. To convert String
to unique long
, we'll extend another Mahout AbstractIDMigrator
helper class, which is designed exactly for this task.
Now, let's first look at how FileDataModel
is extended:
class StringItemIdFileDataModel extends FileDataModel { //initialize migrator to covert String to unique long public ItemMemIDMigrator memIdMigtr; public StringItemIdFileDataModel(File dataFile, String regex) throws IOException { super(dataFile, regex); } @Override protected long readItemIDFromString(String value) { if (memIdMigtr == null) { memIdMigtr = new ItemMemIDMigrator(); } // convert to long long retValue = memIdMigtr.toLongID(value); //store it to cache if (null == memIdMigtr.toStringID(retValue)) { try { memIdMigtr.singleInit(value); } catch (TasteException e) { e.printStackTrace(); } } return retValue; } // convert long back to String String getItemIDAsString(long itemId) { return memIdMigtr.toStringID(itemId); } }
Other useful methods that can be overridden are as follows:
readUserIDFromString(String value)
if user IDs are not numericreadTimestampFromString(String value)
to change how timestamp is parsedNow, let's take a look how AbstractIDMIgrator
is extended:
class ItemMemIDMigrator extends AbstractIDMigrator { private FastByIDMap<String> longToString; public ItemMemIDMigrator() { this.longToString = new FastByIDMap<String>(10000); } public void storeMapping(long longID, String stringID) { longToString.put(longID, stringID); } public void singleInit(String stringID) throws TasteException { storeMapping(toLongID(stringID), stringID); } public String toStringID(long longID) { return longToString.get(longID); } }
Now, we have everything in place and we can load our dataset with the following code:
StringItemIdFileDataModel model = new StringItemIdFileDataModel( new File("datasets/chap6/BX-Book-Ratings.csv"), ";"); System.out.println( "Total items: " + model.getNumItems() + " Total users: " +model.getNumUsers());
This outputs the total number of users and items:
Total items: 340556 Total users: 105283
We are ready to move on and start making recommendations.
Alternately, we can also load the data from database using one of the JDBC data models. In this chapter, we will not dive into the detailed instructions about how to set up database, connection, and so on, but just give a sketch on how this can be done.
Database connectors have been moved to a separate package mahout-integration
, hence we have to first add the package to our dependency list. Open the pom.xml
file and add the following dependency:
<dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-integration</artifactId> <version>0.7</version> </dependency>
Consider that we want to connect a MySQL database. In this case, we will also need a package that handles database connections. Add the following to the pom.xml
file:
<dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.35</version> </dependency>
Now, we have all the packages, so we can create a connection. First, let's initialize a DataSource
class with connection details, as follows:
MysqlDataSource dbsource = new MysqlDataSource(); dbsource.setUser("user"); dbsource.setPassword("pass"); dbsource.setServerName("hostname.com"); dbsource.setDatabaseName("db");
Mahout integration implements JDBCDataModel
to various databases that can be accessed via JDBC. By default, this class assumes that there is DataSource
available under the JNDI name jdbc/taste
, which gives access to a database with a taste_preferences
table with the following schema:
CREATE TABLE taste_preferences ( user_id BIGINT NOT NULL, item_id BIGINT NOT NULL, preference REAL NOT NULL, PRIMARY KEY (user_id, item_id) ) CREATE INDEX taste_preferences_user_id_index ON taste_preferences (user_id); CREATE INDEX taste_preferences_item_id_index ON taste_preferences (item_id);
A database-backed data model is initialized as follows. In addition to the DB connection object, we can also specify the custom table name and table column names, as follows:
DataModel dataModel = new MySQLJDBCDataModel(dbsource, "taste_preferences", "user_id", "item_id", "preference", "timestamp");
Last, but not least, the data model can be created on the fly and held in memory. A database can be created from an array of preferences holding user ratings for a set of items.
We can proceed as follows. First, we create a FastByIdMap
hash map of preference arrays, PreferenceArray
, which stores an array of preferences:
FastByIDMap <PreferenceArray> preferences = new FastByIDMap <PreferenceArray> ();
Next, we can create a new preference array for a user that will hold their ratings. The array must be initialized with a size parameter that reserves that many slots in memory:
PreferenceArray prefsForUser1 = new GenericUserPreferenceArray (10);
Next, we set user ID for current preference at position 0
. This will actually set the user ID for all preferences:
prefsForUser1.setUserID (0, 1L);
Set item ID for current preference at position 0
:
prefsForUser1.setItemID (0, 101L);
Set preference value for preference at 0
:
prefsForUser1.setValue (0, 3.0f);
Continue for other item ratings:
prefsForUser1.setItemID (1, 102L); prefsForUser1.setValue (1, 4.5F);
Finally, add user preferences to the hash map:
preferences.put (1L, prefsForUser1); // use userID as the key
The preference hash map can be now used to initialize GenericDataModel
:
DataModel dataModel = new GenericDataModel(preferences);
This code demonstrates how to add two preferences for a single user; while in practical application, you'll want to add multiple preferences for multiple users.
Recommendation engines in Mahout can be built with the org.apache.mahout.cf.taste
package, which was formerly a separate project called Taste
and has continued development in Mahout.
A Mahout-based collaborative filtering engine takes the users' preferences for items (tastes) and returns the estimated preferences for other items. For example, a site that sells books or CDs could easily use Mahout to figure out the CDs that a customer might be interested in listening to with the help of the previous purchase data.
Top-level packages define the Mahout interfaces to the following key abstractions:
A general structure of the concepts is shown in the following diagram:
The most basic user-based collaborative filtering can be implemented by initializing the previously described components as follows.
First, load the data model:
StringItemIdFileDataModel model = new StringItemIdFileDataModel( new File("/datasets/chap6/BX-Book-Ratings.csv", ";");
Next, define how to calculate how the users are correlated, for example, using Pearson correlation:
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
Next, define how to tell which users are similar, that is, users that are close to each other according to their ratings:
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
Now, we can initialize a GenericUserBasedRecommender
default engine with data model, neighborhood, and similar objects, as follows:
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
That's it. Our first basic recommendation engine is ready. Let's discuss how to invoke recommendations. First, let's print the items that the user already rated along with ten recommendations for this user:
long userID = 80683; int noItems = 10; List<RecommendedItem> recommendations = recommender.recommend( userID, noItems); System.out.println("Rated items by user:"); for(Preference preference : model.getPreferencesFromUser(userID)) { // convert long itemID back to ISBN String itemISBN = model.getItemIDAsString( preference.getItemID()); System.out.println("Item: " + books.get(itemISBN) + " | Item id: " + itemISBN + " | Value: " + preference.getValue()); } System.out.println(" Recommended items:"); for (RecommendedItem item : recommendations) { String itemISBN = model.getItemIDAsString(item.getItemID()); System.out.println("Item: " + books.get(itemISBN) + " | Item id: " + itemISBN + " | Value: " + item.getValue()); }
This outputs the following recommendations along with their scores:
Rated items: Item: The Handmaid's Tale | Item id: 0395404258 | Value: 0.0 Item: Get Clark Smart : The Ultimate Guide for the Savvy Consumer | Item id: 1563526298 | Value: 9.0 Item: Plum Island | Item id: 0446605409 | Value: 0.0 Item: Blessings | Item id: 0440206529 | Value: 0.0 Item: Edgar Cayce on the Akashic Records: The Book of Life | Item id: 0876044011 | Value: 0.0 Item: Winter Moon | Item id: 0345386108 | Value: 6.0 Item: Sarah Bishop | Item id: 059032120X | Value: 0.0 Item: Case of Lucy Bending | Item id: 0425060772 | Value: 0.0 Item: A Desert of Pure Feeling (Vintage Contemporaries) | Item id: 0679752714 | Value: 0.0 Item: White Abacus | Item id: 0380796155 | Value: 5.0 Item: The Land of Laughs : A Novel | Item id: 0312873115 | Value: 0.0 Item: Nobody's Son | Item id: 0152022597 | Value: 0.0 Item: Mirror Image | Item id: 0446353957 | Value: 0.0 Item: All I Really Need to Know | Item id: 080410526X | Value: 0.0 Item: Dreamcatcher | Item id: 0743211383 | Value: 7.0 Item: Perplexing Lateral Thinking Puzzles: Scholastic Edition | Item id: 0806917695 | Value: 5.0 Item: Obsidian Butterfly | Item id: 0441007813 | Value: 0.0 Recommended items: Item: Keeper of the Heart | Item id: 0380774933 | Value: 10.0 Item: Bleachers | Item id: 0385511612 | Value: 10.0 Item: Salem's Lot | Item id: 0451125452 | Value: 10.0 Item: The Girl Who Loved Tom Gordon | Item id: 0671042858 | Value: 10.0 Item: Mind Prey | Item id: 0425152898 | Value: 10.0 Item: It Came From The Far Side | Item id: 0836220730 | Value: 10.0 Item: Faith of the Fallen (Sword of Truth, Book 6) | Item id: 081257639X | Value: 10.0 Item: The Talisman | Item id: 0345444884 | Value: 9.86375 Item: Hamlet | Item id: 067172262X | Value: 9.708363 Item: Untamed | Item id: 0380769530 | Value: 9.708363
The ItemSimilarity
is the most important point to discuss here. Item-based recommenders are useful as they can take advantage of something very fast: they base their computations on item similarity, not user similarity, and item similarity is relatively static. It can be precomputed, instead of recomputed in real time.
Thus, it's strongly recommended that you use GenericItemSimilarity
with precomputed similarities if you're going to use this class. You can use PearsonCorrelationSimilarity
too, which computes similarities in real time, but you will probably find this painfully slow for large amounts of data:
StringItemIdFileDataModel model = new StringItemIdFileDataModel( new File("datasets/chap6/BX-Book-Ratings.csv"), ";"); ItemSimilarity itemSimilarity = new PearsonCorrelationSimilarity(model); ItemBasedRecommender recommender = new GenericItemBasedRecommender(model, itemSimilarity); String itemISBN = "0395272238"; long itemID = model.readItemIDFromString(itemISBN); int noItems = 10; List<RecommendedItem> recommendations = recommender.mostSimilarItems(itemID, noItems); System.out.println("Recommendations for item: "+books.get(itemISBN)); System.out.println(" Most similar items:"); for (RecommendedItem item : recommendations) { itemISBN = model.getItemIDAsString(item.getItemID()); System.out.println("Item: " + books.get(itemISBN) + " | Item id: " + itemISBN + " | Value: " + item.getValue()); }
Recommendations for item: Close to the Bone Most similar items: Item: Private Screening | Item id: 0345311396 | Value: 1.0 Item: Heartstone | Item id: 0553569783 | Value: 1.0 Item: Clockers / Movie Tie In | Item id: 0380720817 | Value: 1.0 Item: Rules of Prey | Item id: 0425121631 | Value: 1.0 Item: The Next President | Item id: 0553576666 | Value: 1.0 Item: Orchid Beach (Holly Barker Novels (Paperback)) | Item id: 0061013412 | Value: 1.0 Item: Winter Prey | Item id: 0425141233 | Value: 1.0 Item: Night Prey | Item id: 0425146413 | Value: 1.0 Item: Presumed Innocent | Item id: 0446359866 | Value: 1.0 Item: Dirty Work (Stone Barrington Novels (Paperback)) | Item id: 0451210158 | Value: 1.0
The resulting list returns a set of items similar to particular item that we selected.
It often happens that some business rules require us to boost the score of the selected items. In the book dataset, for example, if a book is recent, we want to give it a higher score. That's possible using the IDRescorer
interface implementing, as follows:
rescore(long, double)
that takes itemId
and original score as an argument and returns a modified scoreisFiltered(long)
that may return true
to exclude a specific item from recommendation or false
otherwiseOur example could be implemented as follows:
class MyRescorer implements IDRescorer { public double rescore(long itemId, double originalScore) { double newScore = originalScore; if(bookIsNew(itemId)){ originalScore *= 1.3; } return newScore; } public boolean isFiltered(long arg0) { return false; } }
An instance of IDRescorer
is provided when invoking recommender.recommend
:
IDRescorer rescorer = new MyRescorer(); List<RecommendedItem> recommendations = recommender.recommend(userID, noItems, rescorer);
You might wonder how to make sure that the returned recommendations make any sense? The only way to be really sure about how effective recommendations are is to use A/B testing in a live system with real users. For example, the A group receives a random item as a recommendation, while the B group receives an item recommended by our engine.
As this is neither always possible nor practical, we can get an estimate with offline statistical evaluation. One way to proceed is to use the k-fold cross validation introduced in Chapter 1, Applied Machine Learning Quick Start. We partition dataset into multiple sets, some are used to train our recommendation engine and the rest to test how well it recommends items to unknown users.
Mahout implements the RecommenderEvaluator
class that splits a dataset in two parts. The first part, 90% by default, is used to produce recommendations, while the rest of the data is compared against estimated preference values to test the match. The class does not accept a recommender
object directly, you need to build a class implementing the RecommenderBuilder
interface instead, which builds a recommender
object for a given DataModel
object that is then used for testing. Let's take a look at how this is implemented.
First, we create a class that implements the RecommenderBuilder
interface. We need to implement the buildRecommender
method, which will return a recommender
, as follows:
public class BookRecommender implements RecommenderBuilder { public Recommender buildRecommender(DataModel dataModel) { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model); UserBasedRecommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); return recommender; } }
Now that we have class that returns a recommender object, we can initialize a RecommenderEvaluator
instance. Default implementation of this class is the AverageAbsoluteDifferenceRecommenderEvaluator
class, which computes the average absolute difference between the predicted and actual ratings for users. The following code shows how to put the pieces together and run a hold-out test.
First, load a data model:
DataModel dataModel = new FileDataModel( new File("/path/to/dataset.csv"));
Next, initialize an evaluator
instance, as follows:
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
Initialize the BookRecommender
object, implementing the RecommenderBuilder
interface:
RecommenderBuilder builder = new MyRecommenderBuilder();
Finally, call the evaluate()
method, which accepts the following parameters:
RecommenderBuilder
: This is the object implementing RecommenderBuilder
that can build recommender
to testDataModelBuilder
: DataModelBuilder
to use, or if null, a default DataModel
implementation will be usedDataModel
: This is the dataset that will be used for testingtrainingPercentage
: This indicates the percentage of each user's preferences to use to produced recommendations; the rest are compared to estimated preference values to evaluate the recommender
performanceevaluationPercentage
: This is the percentage of users to be used in evaluationThe method is called as follows:
double result = evaluator.evaluate(builder, null, model, 0.9, 1.0); System.out.println(result);
The method returns a double
, where 0
presents the best possible evaluation, meaning that the recommender perfectly matches user preferences. In general, lower the value, better the match.
What about the online aspect? The above will work great for existing users; but what about new users which register in the service? For sure, we want to provide some reasonable recommendations for them as well. Creating a recommendation instance is expensive (it definitely takes longer than a usual network request), so we can't just create a new recommendation each time.
Luckily, Mahout has a possibility of adding temporary users to a data model. The general set up is as follows:
The first part (periodically recreating the recommender) may be actually quite tricky if you are limited on memory: when creating the new recommender, you need to hold two copies of the data in memory (in order to still be able to server requests from the old one). However, as this doesn't really have anything to do with recommendations, I won't go into details here.
As for the temporary users, we can wrap our data model with a PlusAnonymousConcurrentUserDataModel
instance. This class allows us to obtain a temporary user ID; the ID must be later released so that it can be reused (there's a limited number of such IDs). After obtaining the ID, we have to fill in the preferences, and then, we can proceed with the recommendation as always:
class OnlineRecommendation{ Recommender recommender; int concurrentUsers = 100; int noItems = 10; public OnlineRecommendation() throws IOException { DataModel model = new StringItemIdFileDataModel( new File /chap6/BX-Book-Ratings.csv"), ";"); PlusAnonymousConcurrentUserDataModel plusModel = new PlusAnonymousConcurrentUserDataModel(model, concurrentUsers); recommender = ...; } public List<RecommendedItem> recommend(long userId, PreferenceArray preferences){ if(userExistsInDataModel(userId)){ return recommender.recommend(userId, noItems); } else{ PlusAnonymousConcurrentUserDataModel plusModel = (PlusAnonymousConcurrentUserDataModel) recommender.getDataModel(); // Take an available anonymous user form the poll Long anonymousUserID = plusModel.takeAvailableUser(); // Set temporary preferences PreferenceArray tempPrefs = preferences; tempPrefs.setUserID(0, anonymousUserID); tempPrefs.setItemID(0, itemID); plusModel.setTempPrefs(tempPrefs, anonymousUserID); List<RecommendedItem> results = recommender.recommend(anonymousUserID, noItems); // Release the user back to the poll plusModel.releaseUser(anonymousUserID); return results; } } }