The Apriori implementation

The goal of this chapter is to produce rules of the following form: if a person recommends these movies, they will also recommend this movie. We will also discuss extensions where a person recommends a set of movies is likely to recommend another particular movie.

To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie:

all_ratings["Favorable"] = all_ratings["Rating"] > 3

We can see the new feature by viewing the dataset:

all_ratings[10:15]
 

UserID

MovieID

Rating

Datetime

Favorable

10

62

257

2

1997-11-12 22:07:14

False

11

286

1014

5

1997-11-17 15:38:45

True

12

200

222

5

1997-10-05 09:05:40

True

13

210

40

3

1998-03-27 21:59:54

False

14

224

29

3

1998-02-21 23:40:57

False

We will sample our dataset to form a training dataset. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users:

ratings = all_ratings[all_ratings['UserID'].isin(range(200))]

Next, we can create a dataset of only the favorable reviews in our sample:

favorable_ratings = ratings[ratings["Favorable"]]

We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable. We can compute this by grouping the dataset by the User ID and iterating over the movies in each group:

favorable_reviews_by_users = dict((k, frozenset(v.values))
                                   for k, v in favorable_ratings
                                   groupby("UserID")["MovieID"])

In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user. Sets are much faster than lists for this type of operation, and we will use them in a later code.

Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review:

num_favorable_by_movie = ratings[["MovieID", "Favorable"]]. groupby("MovieID").sum()

We can see the top five movies by running the following code:

num_favorable_by_movie.sort("Favorable", ascending=False)[:5]

Let's see the top five movies list:

MovieID

Favorable

50

100

100

89

258

83

181

79

174

74

The Apriori algorithm

The Apriori algorithm is part of our affinity analysis and deals specifically with finding frequent itemsets within the data. The basic procedure of Apriori builds up new candidate itemsets from previously discovered frequent itemsets. These candidates are tested to see if they are frequent, and then the algorithm iterates as explained here:

  1. Create initial frequent itemsets by placing each item in its own itemset. Only items with at least the minimum support are used in this step.
  2. New candidate itemsets are created from the most recently discovered frequent itemsets by finding supersets of the existing frequent itemsets.
  3. All candidate itemsets are tested to see if they are frequent. If a candidate is not frequent then it is discarded. If there are no new frequent itemsets from this step, go to the last step.
  4. Store the newly discovered frequent itemsets and go to the second step.
  5. Return all of the discovered frequent itemsets.

This process is outlined in the following workflow:

The Apriori algorithm

Implementation

On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in second step.

We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code:

frequent_itemsets = {}

We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but feel free to try different values. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's apply minimum support:

min_support = 50

To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform set operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot). Let's look at the following code:

frequent_itemsets[1] = dict((frozenset((movie_id,)),
                                 row["Favorable"])
                                 for movie_id, row in num_favorable_by_movie.iterrows()
                                 if row["Favorable"] > min_support)

We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function and the counting dictionary:

from collections import defaultdict
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)

In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small), it is a good practice to get into for larger applications. We iterate over all of the users and their reviews:

    for user, reviews in favorable_reviews_by_users.items():

Next, we go through each of the previously discovered itemsets and see if it is a subset of the current set of reviews. If it is, this means that the user has reviewed each movie in the itemset. Let's look at the code:

        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):

We can then go through each individual movie that the user has reviewed that isn't in the itemset, create a superset from it, and record in our counting dictionary that we saw this particular itemset. Let's look at the code:

                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1

We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those:

    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

To run our code, we now create a loop that iterates over the steps of the Apriori algorithm, storing the new itemsets as we go. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code:

for k in range(2, 20):
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],
          min_support)
     frequent_itemsets[k] = cur_frequent_itemsets

We want to break out the preceding loop if we didn't find any new frequent itemsets (and also to print a message to let us know what is going on):

    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break

Note

We use sys.stdout.flush() to ensure that the printouts happen while the code is still running. Sometimes, in large loops in particular cells, the printouts will not happen until the code has completed. Flushing the output in this way ensures that the printout happens when we want. Don't do it too much though—the flush operation carries a computational cost (as does printing) and this will slow down the program.

If we do find frequent itemsets, we print out a message to let us know the loop will be running again. This algorithm can take a while to run, so it is helpful to know that the code is still running while you wait for it to complete! Let's look at the code:

    else:
        print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
        sys.stdout.flush()

Finally, after the end of the loop, we are no longer interested in the first set of itemsets anymore—these are itemsets of length one, which won't help us create association rules – we need at least two items to create association rules. Let's delete them:

del frequent_itemsets[1]

You can now run this code. It may take a few minutes, more if you have older hardware. If you find you are having trouble running any of the code samples, take a look at using an online cloud provider for additional speed. Details about using the cloud to do the work are given in Chapter 13, Next Steps....

The preceding code returns 1,718 frequent itemsets of varying lengths. You'll notice that the number of itemsets grows as the length increases before it shrinks. It grows because of the increasing number of possible rules. After a while, the large number of combinations no longer has the support necessary to be considered frequent. This results in the number shrinking. This shrinking is the benefit of the Apriori algorithm. If we search all possible itemsets (not just the supersets of frequent ones), we would be searching thousands of times more itemsets to see if they are frequent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset