An overview of association analysis

Association analysis is a data mining technique that has the purpose of finding the optimal combination of products or services and allows marketers to exploit this knowledge to provide recommendations, optimize product placement, or develop marketing programs that take advantage of cross-selling. In short, the idea is to identify which items go well together, and profit from this.

You can think of the results of the analysis as an if...then statement. If a customer buys an airplane ticket, then there is a 46 % probability that they'll buy a hotel room, and if they go on to buy a hotel room, then there is a 33 % probability that they'll rent a car. 

However, it isn't just for sales and marketing. It's also used in fraud detection and healthcare; for example, if a patient undergoes treatment A, then there's a 26 % probability that they'll exhibit symptom X. Before going into the details, we should have a look at some terminology, as follows:

  • Itemset: This is a collection of one or more items in the dataset.
  • Support: This is the proportion of the transactions in the data that contain an itemset of interest.
  • Confidence: This is the conditional probability that, if a person purchases or does x, they'll purchase or do y; the act of doing x is referred to as the antecedent or left-hand side (LHS), and y is the consequence or right-hand side (RHS).
  • Lift: This is the ratio of the support of x occurring together with y divided by the probability that x and y occur if they are independent. It's the confidence divided by the probability of x times the probability of y; for example, say that we have the probability of x and y occurring together as 10 %, and the probability of x is 20 %, and y is 30 %, then the lift would be 10 % (20 % times 30 %) or 16.67 %.

The package in R that you can use to perform a market basket analysis is arules: Mining Association Rules and Frequent with Itemsets. The package offers two different methods for finding rules apriori and ECLAT. There are other algorithms we can use to conduct a market basket analysis, but apriori is used most frequently, and so will be our focus.

With apriori, the principle is that, if an itemset is frequent, then all of its subsets must also be frequent. A minimum frequency (support) is determined by the analyst before executing the algorithm, and once established, the algorithm will run as follows:

  • Let k=1 (the number of items)
  • Generate itemsets of a length that is equal to or greater than the specified support
  • Iterate k + (1...n), pruning those that are infrequent (less than the support)
  • Stop the iteration when no new frequent itemsets are identified

Once you have an ordered summary of the most frequent itemsets, you can continue the analysis process by examining the confidence and lift to offers the associations of interest.

Before we delve into the analysis, it's necessary to understand how to put your raw data into the appropriate structure, referred to as R class transactions. This can be a confusing task, so I'm going to spend some time on this before moving on to a full demonstration of association analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset