Chapter 16: Market Basket Analysis

Everybody is familiar with recommender systems, such as those used by Amazon to offer you new books that you haven’t read or by iTunes to offer new songs that you haven’t heard. Perhaps you have checked out the table of contents and a few pages of a recommended book and decided to buy it. Or perhaps you have listened to a 30-second clip from a recommended song, thought to yourself, “I like this,” and then bought the song. How did Amazon know that you might like that book, or how did Apple know that you might enjoy that song?

Association Analyses

Such recommendations come from association analysis (also called affinity analysis), the most common form of which is market-basket analysis. Association rules were first developed to analyze a one-time purchase of several items, such as the contents of a cart in a grocery store. The primary purpose is to determine which of the items are commonly sold together. Bread and milk commonly are sold together, but this is obvious. You are not interested in such trivial rules. The purpose of market-basket analysis is not to uncover the obvious, but to uncover relationships that were not previously known. As shown in Figure 16.1, association rules are in a method based on interdependence.

Figure 16.1: Framework for Multivariate Analysis

Examples

A grocery store, for example, might sell very few jars of an exotic mustard, a fact which, in isolation, might lead the grocery manager to discontinue this product. But suppose that a market-basket analysis reveals that this particular brand of mustard almost always is purchased with a large number of exotic, expensive items that have large profit margins. The grocery manager might then figure that, if these customers no longer can buy the mustard at his or her store, then they might go elsewhere to buy this particular brand of mustard and, while there, also buy the large number of exotic, expensive items that have large profit margins. He or she might save a few pennies by discontinuing the mustard, but then lose the purchases that go with the mustard.

Market-basket analysis can be used to generate new business via coupons and by alerting customers to products they don’t know that they want or need. Suppose that a market-basket analysis shows that peanut butter and bread are commonly bought with jelly. A customer checks out, having purchased only peanut butter and bread. The customer can then be offered a coupon for jelly.

Of course, the method can be applied more generally. A bank might observe that many customers who have a checking account and a certificate of deposit also have a credit card. Thus, the bank might craft a special credit card offer for its customers who have only a checking account and a certificate of deposit.

There is an apocryphal story of which all practitioners of market-basket analysis should be aware:: A convenience store discovered that beer and diapers were frequently sold together. The explanation was that married men with infant children would stop in to buy some diapers and also pick up a six-pack for themselves. No one ever would have guessed that this relationship existed, and the store was able to exploit it by moving the diapers closer to the beer cooler. The truth is much more mundane. In a nutshell, back in the early 1990s, a statistical analysis performed by Osco Pharmacies suggested that, between 5:00 p.m. and 7:00 p.m., customers tended to buy beer and diapers together. But the store never attempted to verify the existence of the relationship or exploit it. For more detail, see Power (2002).

Understand Support, Confidence, and Lift

To motivate the basic ideas of market-basket analysis, Table 16.1 presents a toy data set that describes five hypothetical baskets comprising three or four items each.

Table 16.1: Five Small Market Baskets

Transaction Number	Item 1	Item 2	Item 3	Item 4
1	Sugar	Diapers	Bread	Beer
2	Beer	Bread	Peanut butter	Jelly
3	Bread	Peanut butter	Jelly
4	Mustard	Beer	Diapers	Jelly
5	Diapers	Peanut butter	Magazine	Beer

Each row of Table 16.1 represents a purchase of some number of items. Note that Transaction 3 contains only three items, so the fourth item is a missing value. A subset of items is called an itemset. You search for rules of the following form: IF {X} THEN {Y} where X and Y are itemsets. X is called the antecedent or condition, and Y is called the consequent.

Association Rules

Inspection of Table 16.1 yields a pair of association rules: IF {Beer} THEN {diapers}; and IF {peanut butter, bread} THEN {jelly}. Of course, there are some useless rules in this data set: IF {sugar} THEN {beer} and IF {peanut butter} THEN {beer}. Clearly, any transaction data set can give rise to a seemingly countless number of rules, more than you can possibly use. So you will need some way to bring the number of rules down to a manageable level. You will see how to do so momentarily.

Support

The three primary measures of market-basket analysis are support, confidence, and lift. Support measures the proportion of transactions that satisfy a rule:

Support (X, Y) = Number of transactions containing both X and Y Total number of transactions

$Support (X, Y) = \frac{Number of transactions containing both X and Y}{Total number of transactions}$

Confidence

In Table 16.1, if X = {diapers} and Y = {beer}, then support(diapers,beer) = 3/5 = 0.6 Confidence measures how accurate a rule is likely to be:

Confidence (X, Y) = Number of transactions containing both X and Y Number of transaction containing X

$Confidence (X, Y) = \frac{Number of transactions containing both X and Y}{Number of transaction containing X}$

In Table 16.1, if X = {diapers} and Y = {beer}, then confidence(diapers,beer) = 3/3 = 1.00. Every time you observe a transaction with diapers, that same transaction also contains beer. However, the converse is not true. If X = {beer} and Y = {diapers} then confidence(beer,diapers) = 3/4 = 0.75. So when you observe beer in a transaction, only 75% of the time does that transaction contain diapers.

For more complicated rules, the interpretation is similar. Consider the rule IF {a, b} THEN {c} where a, b, and c are individual items. If the confidence for this rule is 50%, then this means that every time a and b appear in a basket, there is a 50% chance that c is in the basket also.

Lift

A problem with confidence is that, if the Y item set is very common, then the rule is not very useful. For example, if confidence(X,Y) is 60%, but Y is in 70% of the transactions, then the rule IF {X} THEN {Y} is not very useful; the rule is worse than a random guess. The concept of lift addresses this problem.

Lift measures the strength of the rule by comparing the rule to random guesses.

Lift (X, Y) = Support ( X , Y ) Support ( X , X ) * Support ( Y , Y )

$Lift (X, Y) = \frac{Support (X, Y)}{Support (X, X) * Support (Y, Y)}$

Here, support(X,X) is just the proportion all transactions that contain X, and support(Y,Y) is the proportion of all transactions that contain Y. If X = {diapers} and Y = {beer}, then lift(diapers, beer) = 0.6/(0.6 x 0.8) = 1.25. Clearly, lift(X,Y) = lift(Y,X). So only one of these needs to be computed. A lift of 1.0 means that the rule is just as good as guessing: X and Y are independent. A lift of less than 1.0 means that the rule is worse than random guessing. Rarely is there a reason to concern yourself with rules that have lift less than unity. Another way to think about lift is to realize that it is the ratio by which the actual confidence of the rule exceeds the confidence that would occur if the items were independent.

Similarly, you will not be interested in rules that don’t have much support. Suppose you have a million transactions. If minimum support is 1%, then only rules supported by at least 10,000 transactions will be considered. This is the primary way that you cut down on the number of rules to analyze. Rules that fall below the minimum support threshold simply don’t have enough observations to be worth considering. It is important to realize, and this is obvious from looking at the equation for lift, that when support gets smaller, lift gets larger. If minimum support is reduced from 0.01 to 0.001, lift can get very large. In general, rules with high lift have low support.

Association analysis works best when items are equally prevalent across the baskets. That way an extremely common item doesn’t dominate the results. It might be useful to limit a market-basket analysis to a subgroup, such as vegetables, rather than across groups, such as dairy and vegetables, where milk would be the dominant item and make it hard to uncover the relationships in the vegetables subgroup.

Use JMP to Calculate Confidence and Lift

Now you will apply JMP to Table 16.1 and try to reproduce a couple of the calculations from the previous section—specifically, confidence(diapers,beer) = 1.0, and lift(diapers,beer) = 1.25; and the converse, confidence(beer,diapers) = 0.75, and lift(beer,diapers) = 1.25.

The file ToyAssociation.jmp re-creates Table 16.1; open it. Remember that data are not always organized the way that the software wants, and here is a case in point. The association analysis procedure in JMP requires two columns of data: one for the ID (in this case, the transaction number), and another column of items. The data in Table 16.1 need to be stacked before you can use them:

1. Select Tables ▶ Stack and stack the items, omitting the Transaction Number.

2. Select Items 1 through 4 and click Stack Columns. click OK. The resulting data table has three columns: Transaction Number, Label, and Data. You can ignore Label. Select Analyze ▶ Screening ▶ Association Analysis.

3. Select Transaction Number and click ID.

4. Select Data and click Item. Leave everything else at default. Click OK.

Look at the first two lines of the Association Analysis window, which reproduce the results that you calculated above: confidence(beer,diapers) = 75%, confidence(diapers,beer) = 75%: lift(beer,diapers) = lift(diapers,beer) = 1.25.

Use the A Priori Algorithm for More Complex Data Sets

For this toy set, it was easy for the computer to generate all the possible itemsets and analyze them. Suppose, however, that your data set had n = 5 items. How many itemsets might there be? (2 ^ 5) − 1 = 31. Why the “−1”? Imagine using 0 for not buy and 1 for buy. There are 2 ^ 5 = 32 possible sequences of zeros and 1s, including the sequence 0 0 0 0 0, which represents no purchases at all. So you have to subtract from that one sequence. If n = 100 then (2 ^ 100) − 1 . A typical grocery store has 40,000 items, and the number of possible itemsets is too large for any computer. You need some way to pare this number down to a manageable size so that a computer can handle it. There are many algorithms that do so, but the most popular is the a priori algorithm.

The a priori algorithm begins with the idea that you will only consider itemsets that have some minimum level of support. This might be 1%, it might be 10%. These are called frequent itemsets. Next, you generate all frequent itemsets that contain just one item. The key insight that drives the a priori algorithm is this: Frequent itemsets with two items can contain only frequent itemsets that contain one item. Suppose drain cleaner appears in fewer than 1% of baskets. Not only can’t it be in frequent itemsets that contain one item—it also necessarily cannot be in frequent itemsets that contain two items. Similarly, frequent itemsets with three items are derived from frequent itemsets with two items, and so on.

Form Rules and Calculate Confidence and Lift

Generating each level of frequent itemset requires only one pass through the data. Once all the frequent itemsets are generated, rules can be easily formed, and these rules can have confidence and lift calculated. Minimum levels of confidence and lift can be applied to filter out weak rules, for you are really interested in strong rules that have high confidence and good lift.

Nonetheless, not all rules with high confidence and good lift will be useful. These rules will generally fall into one of three categories: actionable, trivial (useless), and inexplicable; with actionable rules being the rarest.

● An example of a useless rule is this: Persons who purchase maintenance agreements usually purchase large appliances.

● An example of an inexplicable rule is this: Persons who buy toilet cleaner also buy bread.

● An actionable rule might be this: Persons who buy bread and peanut butter also buy jelly.

You can then give jelly coupons with a recipe for peanut butter and jelly sandwiches on the back to persons who buy just bread and peanut butter, in an attempt to convert them to regular jelly purchasers.

In practice, you need to play with the minimum support and confidence levels to find as many interesting rules as possible without including too many trivial or inexplicable rules.

Analyze a Real Data Set

Open the data file GroceryPurchases.jmp, which is due to the efforts of Brijs et al. (1999). Observe that it is already stacked: Each row contains a single purchased item and an ID indicating from which basket the item came. This file has 7007 rows, and the ID numbers run from zero to 1000, indicating 1001 baskets with an average of seven items per basket.

Perform Association Analysis with Default Settings

Run Association Analysis with the default settings:

1. Select Analyze ▶ Screening ▶ Association Analysis.

2. Select Customer ID, and click ID.

3. Select Product and click Item. Leave everything else at default. Click OK.

Looking at the Rules in the output window, you see that the first thing to observe is that there are far too many rules for you to comprehend.

Reduce the Number of Rules and Sort Them

Reduce the size of the list of rules:

1. Click the red triangle next to Association Analysis at the top of the output window.

2. Select Redo ▶ Relaunch Analysis.

3. Change Minimum Confidence to 0.9 and Minimum Lift to 2.0.

4. Click OK.

The list is appreciably shorter. The next thing to observe is that the list of Rules is not sorted in any meaningful way. To remedy this, right-click Confidence and then click Sort by Column ▶ Confidence. For follow-up visualization, you can also turn this list into a data table and then make a scatter plot:

1. Right-click over the text/numbers of the Rules, and select Make Into Data Table.

2. You can right-click the column name Confidence, select Sort, and then Descending. See that many rules have very high confidence.

3. Right-click the column name Lift, select Sort and then Descending. See that many rules have very high lift.

Examine Results

To view both Confidence and Lift at once, use a scatter plot:

1. On the data table, select Graph ▶ Graph Builder.

2. Drag Confidence to Y and Lift to X.

3. The scatterplot might have a smoothing line through it. This is a distraction that is not appropriate for visualizing these data. If you see the smoothing line, click the second from left graph box at the top of the window to remove it and leave the points. (If the leftmost box is shaded, that means “show the points,” and if the second box is shaded, that means “show the smoother.”)

You see that about half the points have Confidence between 94% and 100% with a Lift between 2 and 3 as shown in Figure 16.2.

Figure 16.2: Confidence versus Lift for the GroceriesPurchase Data

Return to the Association Analysis output window. Click the sideways triangle next to Frequent Item Sets. You see immediately that the customers of this store are quite fond of Heineken beer; it is in 60% of the baskets. You also see that the customers are fond of crackers, herring, and olives, as shown in Figure 16.3. Scanning down the list, you see that Coke has 30% support. Perhaps you are interested in increasing Coke sales. Two relevant questions are as follows:

1. What are customers likely to buy before buying Coke?

2. What are customers likely to buy if they purchase Coke?

Figure 16.3: Frequent Item Sets

Looking at your list of Rules with a Minimum Confidence of 0.9, you see that there is not much scope for improvement. Everybody here who is already buying Coke or going to buy Coke is doing so with a very high probability.

Target Results to Take Business Actions

Rerun the analysis, this time with a Minimum Confidence of 0.6. Right away, in the first two Rules, you see two opportunities, both with Confidence of about 70%: Coke => ice cream and ice cream => Coke. There is a strong tendency for Coke and ice cream to be purchased together, but there are still plenty of people who buy one but not the other. So there are many potential customers. For people who already buy Coke but not ice cream, you can issue a coupon for ice cream at check-out. You can also place an advertisement for ice cream in the Coke aisle. For persons who buy ice cream but not Coke, you issue coupons at check-out and place a Coke advertisement near the ice cream cooler.

The scope for association analysis is quite wide. In addition to analyzing market baskets, it has been used for the following:

● Analyzing a mailing list of donors to find out what characteristics lead to a donation

● Comparing different stores in a chain to uncover different selling patterns at different locations

● Conducting computer network forensics to detect and deter attacks by developing a signature pattern of an attack

● Finding biologically important relationships between different genes

● Determining which customers of a health insurance company are at risk of developing diseases

Exercises

1. Identify other opportunities for increasing Coke sales, or for using Coke to spur sales of other goods.

2. For the data in Table 16.1, compute Confidence and Lift for X = {bread} and Y= {jelly, peanut butter} by hand. Then check your answer by using JMP.

3. Calculate the Support of X = {bread} and Y = {jelly, peanut butter}.

4. Analyze the GroceryPurchases.jmp data and find some actionable rules.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 16: Market Basket Analysis

Create new playlist

Sign In

Sign Up