The following is Wikipedia's definition of recommender systems:
Recommender systems have gained immense popularity in recent years. Amazon uses them to recommend books, Netflix for movies, and Google News to recommend news stories. As the proof is in the pudding, here are some examples of the impact recommendations can have (source: Celma, Lamere, 2008):
- Two-thirds of the movies watched on Netflix are recommended
- 38 % of the news clicks on Google News are recommended
- 35 % of the sales at Amazon sales are the result of recommendations
As we saw in the previous chapters, features and feature selection play a major role in the efficacy of machine learning algorithms. Recommender engine algorithms discover these features, called latent features, automatically. In short, there are latent features responsible for a user to like one movie and dislike another. If another user has corresponding latent features, there is a good chance that this person will also have a similar taste for movies.
To understand this better, let's look at some sample movie ratings:
Movie |
Rich |
Bob |
Peter |
Chris |
Titanic |
5 |
3 |
5 |
? |
GoldenEye |
3 |
2 |
1 |
5 |
Toy Story |
1 |
? |
2 |
2 |
Disclosure |
4 |
4 |
? |
4 |
Ace Ventura |
4 |
? |
4 |
? |
Our goal is to predict the missing entries shown with the ? symbol. Let's see if we can find some features associated with the movies. At first, you will look at the genres, as shown here:
Movie |
Genre |
Titanic |
Action and Romance |
GoldenEye |
Action, Adventure, and Thriller |
Toy Story |
Animation, Children's, and Comedy |
Disclosure |
Drama and Thriller |
Ace Ventura |
Comedy |
Now, each movie can be rated for each genre from 0 to 1. For example, GoldenEye is not primarily a romance, so it may have 0.1 rating for romance, but a 0.98 rating for action. Therefore, each movie can be represented as a feature vector.
In this chapter, we are going to use the MovieLens dataset from grouplens.org/datasets/movielens/ for F. Maxwell Harper and Joseph A. Konstan, 2015. Go to The MovieLens Dataset: History and Context, ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI http://dx.doi.org/10.1145/2827872.
The InfoObjects big data sandbox comes loaded with 1 million movie ratings. In this recipe, we are using 20 million ratings, which have been loaded on S3 for your convenience. Since it will require heavy-duty compute, we recommend using either Databricks Cloud or EMR. Feel free to use Sandbox if you have a machine with server-level configuration.
We are going to use two files from this dataset:
- ratings.csv: This has a comma-separated list of movie ratings in the following format:
user id , movie id , rating , epoch time
Since we are not going to need the timestamp, we are going to filter it out from the data in our recipe
- movies.csv: This has a comma-separated list of movies in the following format:
movie id | movie title | genre
This chapter will cover how we can make recommendations using Spark ML.