Introduction

The following is Wikipedia's definition of recommender systems:

"Recommender systems are a subclass of information filtering system that seeks to predict the rating or preference that user would give to an item."

Recommender systems have gained immense popularity in recent years. Amazon uses them to recommend books, Netflix for movies, and Google News to recommend news stories. As the proof is in the pudding, here are some examples of the impact recommendations can have (source: Celma, Lamere, 2008):

Two-thirds of the movies watched on Netflix are recommended
38 % of the news clicks on Google News are recommended
35 % of the sales at Amazon sales are the result of recommendations

As we saw in the previous chapters, features and feature selection play a major role in the efficacy of machine learning algorithms. Recommender engine algorithms discover these features, called latent features, automatically. In short, there are latent features responsible for a user to like one movie and dislike another. If another user has corresponding latent features, there is a good chance that this person will also have a similar taste for movies.

To understand this better, let's look at some sample movie ratings:

Movie	Rich	Bob	Peter	Chris
Titanic	5	3	5	?
GoldenEye	3	2	1	5
Toy Story	1	?	2	2
Disclosure	4	4	?	4
Ace Ventura	4	?	4	?

Our goal is to predict the missing entries shown with the ? symbol. Let's see if we can find some features associated with the movies. At first, you will look at the genres, as shown here:

Movie	Genre
Titanic	Action and Romance
GoldenEye	Action, Adventure, and Thriller
Toy Story	Animation, Children's, and Comedy
Disclosure	Drama and Thriller
Ace Ventura	Comedy

Now, each movie can be rated for each genre from 0 to 1. For example, GoldenEye is not primarily a romance, so it may have 0.1 rating for romance, but a 0.98 rating for action. Therefore, each movie can be represented as a feature vector.

In this chapter, we are going to use the MovieLens dataset from grouplens.org/datasets/movielens/ for F. Maxwell Harper and Joseph A. Konstan, 2015. Go to The MovieLens Dataset: History and Context, ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI http://dx.doi.org/10.1145/2827872.

The InfoObjects big data sandbox comes loaded with 1 million movie ratings. In this recipe, we are using 20 million ratings, which have been loaded on S3 for your convenience. Since it will require heavy-duty compute, we recommend using either Databricks Cloud or EMR. Feel free to use Sandbox if you have a machine with server-level configuration.

We are going to use two files from this dataset:

ratings.csv: This has a comma-separated list of movie ratings in the following format:

        user id , movie id , rating , epoch time

Since we are not going to need the timestamp, we are going to filter it out from the data in our recipe

movies.csv: This has a comma-separated list of movies in the following format:

        movie id | movie title | genre

This chapter will cover how we can make recommendations using Spark ML.

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction