Introduction

The following is Wikipedia's definition of recommender systems:

"Recommender systems are a subclass of information filtering system that seeks to predict the rating or preference that user would give to an item."

Recommender systems have gained immense popularity in recent years. Amazon uses them to recommend books, Netflix for movies, and Google News to recommend news stories. As the proof is in the pudding, here are some examples of the impact recommendations can have (source: Celma, Lamere, 2008):

  • Two-thirds of the movies watched on Netflix are recommended
  • 38 % of the news clicks on Google News are recommended
  • 35 % of the sales at Amazon sales are the result of recommendations

As we saw in the previous chapters, features and feature selection play a major role in the efficacy of machine learning algorithms. Recommender engine algorithms discover these features, called latent features, automatically. In short, there are latent features responsible for a user to like one movie and dislike another. If another user has corresponding latent features, there is a good chance that this person will also have a similar taste for movies.

To understand this better, let's look at some sample movie ratings:

Movie

Rich

Bob

Peter

Chris

Titanic

5

3

5

?

GoldenEye

3

2

1

5

Toy Story

1

?

2

2

Disclosure

4

4

?

4

Ace Ventura

4

?

4

?

Our goal is to predict the missing entries shown with the ? symbol. Let's see if we can find some features associated with the movies. At first, you will look at the genres, as shown here:

Movie

Genre

Titanic

Action and Romance

GoldenEye

Action, Adventure, and Thriller

Toy Story

Animation, Children's, and Comedy

Disclosure

Drama and Thriller

Ace Ventura

Comedy

Now, each movie can be rated for each genre from 0 to 1. For example, GoldenEye is not primarily a romance, so it may have 0.1 rating for romance, but a 0.98 rating for action. Therefore, each movie can be represented as a feature vector.

In this chapter, we are going to use the MovieLens dataset from grouplens.org/datasets/movielens/ for F. Maxwell Harper and Joseph A. Konstan, 2015. Go to The MovieLens Dataset: History and Context, ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI http://dx.doi.org/10.1145/2827872

The InfoObjects big data sandbox comes loaded with 1 million movie ratings. In this recipe, we are using 20 million ratings, which have been loaded on S3 for your convenience. Since it will require heavy-duty compute, we recommend using either Databricks Cloud or EMR. Feel free to use Sandbox if you have a machine with server-level configuration. 

We are going to use two files from this dataset:

  • ratings.csv: This has a comma-separated list of movie ratings in the following format:
        user id , movie id , rating , epoch time 

Since we are not going to need the timestamp, we are going to filter it out from the data in our recipe

  • movies.csv: This has a comma-separated list of movies in the following format:
        movie id | movie title | genre

This chapter will cover how we can make recommendations using Spark ML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset