Association rule mining

Frequent item sets are not very useful by themselves. The next step is to build association rules. Because of this final goal, the whole field of basket analysis is sometimes called association rule mining.

An association rule is a statement of the type, if X, then Y—for example, if a customer bought War and Peace, then they will buy Anna Karenina. Note that the rule is not deterministic (not all customers who buy X will buy Y), but it is rather cumbersome to always spell it out: if a customer bought X, they are more likely than baseline to buy Y; thus, we say if X, then Y, but we mean it in a probabilistic sense.

Interestingly, both the antecedent and the conclusion may contain multiple objects: customers who bought X, Y, and Z also bought A, B, and C. Multiple antecedents may allow you to make more specific predictions than are possible from a single item.

You can get from a frequent set to a rule by just trying all the possible combinations of X implies Y. It is easy to generate many of these rules. However, you only want to have valuable rules. Therefore, we need to measure the value of a rule. A commonly used measure is called the lift. The lift is the ratio between the probability obtained by applying the rule and the baseline, as shown in the following formula:

In the preceding formula, P(Y) is the fraction of all the transactions that include Y, while P(Y|X) is the fraction of transactions that include Y, given that they also include X. Using the lift helps avoid the problem of recommending bestsellers; for a bestseller, both P(Y) and P(Y|X) will be large. Therefore, the lift will be close to one and the rule will be deemed irrelevant. In practice, we wish to have values of lift of at least 10, perhaps even 100.

Refer to the following code:

minlift = 5.0 
nr_transactions = float(len(dataset)) 
for itemset in freqsets: 
    for item in itemset: 
        consequent = frozenset([item]) 
        antecedent = itemset-consequent 
        base = 0.0 
        # acount: antecedent count 
        acount = 0.0 
# ccount : consequent count ccount = 0.0 for d in dataset: if item in d: base += 1 if d.issuperset(itemset): ccount += 1 if d.issuperset(antecedent): acount += 1 base /= nr_transactions p_y_given_x = ccount/acount lift = p_y_given_x / base if lift > minlift: print('Rule {0} -> {1} has lift {2}' .format(antecedent, consequent,lift))

Some of the results are shown in the following table. The counts are the number of transactions that include the consequent alone (that is, the base rate at which that product is bought), all the items in the antecedent, and all the items in the antecedent and the consequent:

Antecedent

Consequent

Consequent count

Antecedent count

Antecedent and consequent count

Lift

1378, 1379, 1380

1269

279 (0.3 percent)

80

57

225

48, 41, 976

117

1026 (1.1 percent)

122

51

35

48, 41, 1,6011

16,010

1316 (1.5 percent )

165

159

64

 

We can see, for example, that there were 80 transactions in which 1378, 1379, and 1380 were bought together. Of these, 57 also included 1269, so the estimated conditional probability is 57/80 ≈ 71 percent. Compared to the fact that only 0.3 percent of all transactions included 1269, this gives us a lift of 255.

The need to have a decent number of transactions in these counts in order to be able to make relatively solid inferences is why we must first select frequent itemsets. If we were to generate rules from an infrequent itemset, the counts would be very small; because of this, the relative values would be meaningless (or subject to very large error bars).

Note that there are many more association rules discovered from this dataset: the algorithm discovers 1030 rules (requiring support for the baskets of at least 80 and a minimum lift of 5). This is still a small dataset when compared to what is now possible with the web. With datasets containing millions of transactions, you can expect to generate many thousands of rules, even millions.

However, for each customer or product, only a few rules will be relevant at any given time. So each customer only receives a small number of recommendations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset