Market basket analysis

Market basket analysis consists of some modeling techniques which are typically used by retailers and e-commerce marketplaces to analyze shopping carts and transactions to find out what customers buy the most, what kind of items they buy, what the peak season is for specific items to be sold the most, and so on. We will be focusing on item based transactional patterns in this chapter for detecting and predicting what items people are buying and are most likely to buy. Let us first look at the formal definition of market basket analysis and then we will look at core concepts, metrics, and techniques tied to it. Finally, we will conclude with how to actually use these results to make data driven decisions.

What does market basket analysis actually mean?

Market basket analysis typically encompasses several modeling techniques based upon the simple principle that while shopping if you buy a certain group of items (also known as an itemset in machine learning lingo), you are likely to buy an other specific item or items along with that itemset. We analyze human shopping patterns and apply statistical techniques to generate frequent itemsets. These itemsets contain combination of items that people are most likely to buy together, based on past shopping history.

A simple example of an itemset would be people buying beer and diapers frequently at the market. The itemset can be depicted as { beer, diapers }. A frequent itemset is indicated by an itemset which occurs more frequently than usual and is specified by a metric known as support, which we will be talking about later on. Hence, from the preceding example you can say that if I buy beer, I am also most likely to buy diapers, and recommend that product to me. We can also build item association rules on top of these itemsets by analyzing shopping purchases. An example association rule can be denoted by using itemsets using the notation, { beer, diapers } -> { milk } which would indicate that if I am buying beer and diapers together, I am most likely to also purchase milk along with that!

Core concepts and definitions

Now that you know what market basket analysis actually does, let us look at some definitions and concepts which are widely used in the algorithms and techniques.

Transactional datasets indicate databases or datasets where the customer's shopping transactions are recorded daily/weekly and consist of various items bought together by the customers. We will take an example transactional dataset which we will also be using later on in the chapter for our algorithms. Consider the following dataset, which you can also get from the shopping_transaction_log.csv file for this chapter. The data is represented in the following figure:

Core concepts and definitions

Each cell in the preceding dataset is also defined as an item. Items are also denoted by the symbol In where n denotes the n-th item number, and examples are enclosed in curly braces in formal definitions and when building algorithm pseudocode or doing some computations by hand. For example, cell combination (1, A) indicates item I1 whose value is depicted as { beer }.

Itemsets are defined as sets or groups of items which were bought together in any shopping transaction. Hence, these items are said to co-occur based on the transactions. We will denote itemsets as ISn where n denotes the n-th itemset number. The itemset values will will be enclosed in curly braces. Each row in the preceding dataset denotes a particular transaction and the combination of items form the itemsets. The itemset IS1 is depicted by { beer, diapers, bread }.

Association rules or just rules are statements which have a left-hand side (LHS) and a right-hand side (RHS), and indicate that if we have the items on the LHS for purchase, we are likely to be interested in purchasing the RHS items too. This signifies that the itemsets are associated with each other. They are denoted as ISx → ISy, which means that if I have itemset x in my shopping cart, I will also be interested in purchasing itemset y along with it. An example rule can be { beer } → { diapers } which indicates that if I have beer in my cart, there is a chance I will buy diapers too! We will now see some metrics which determine how to measure frequent itemsets and the strength of the association rules.

The frequency of an itemset is basically the number of times a particular itemset occurs in the list of all transactions. Do note that the itemset can be a subset of a larger itemset in the transactions and still be counted because the subset denotes that the itemset containing the specific set of items was bought along with some other products. We can denote it as f(ISn), where ISn is a particular itemset and function f( ) gives us the frequency of that itemset in the whole transactional based dataset. Taking our previous dataset, f(IS{beer, diapers}) is six, which indicates IS{beer, diapers} has been purchased six times in total out of all the transactional data in our dataset.

The support of an itemset is defined as the fraction of transactions in our transactional dataset which consists of that particular itemset. Basically, it means the number of times that itemset was purchased divided by the total number of transactions in the dataset. It can be denoted as Core concepts and definitions, where S( ) denotes the support of the itemset ISn. Taking our preceding example, S(IS{beer, diapers}) is Core concepts and definitions which gives us 66.67%. The support for an association rule is similar and can be depicted as Core concepts and definitions, where we use the intersection operator to see the frequency of both the itemsets occurring together in the transactional dataset. The support for the rule we defined earlier, S(IS{beer} → IS{diapers}), is once again Core concepts and definitions or 66.67% because the itemset combining beer and diapers occurs six times in total, as we saw earlier. When evaluating results from association rules or frequent itemsets, the higher the support, the better it is. Support is more about measuring the quality of rules detecting what has already happened from the past transactions.

The confidence of an association rule is defined as the probability or likelihood that, for a new transaction containing itemset in the LHS of the rule, the transaction also contains the itemset on the RHS of the rule. The confidence for a rule can be depicted as Core concepts and definitions, where C( ) denotes the confidence of the rule. Do note that since calculation of support involves dividing itemset frequency by the total number of transactions in the denominator, the RHS of the preceding equation ultimately reduces to getting the frequency of the itemsets for both the numerator and denominator. Thus we get Core concepts and definitions as the reduced formula for getting confidence. The confidence for our earlier rule C(IS{beer} → IS{diapers}) is Core concepts and definitions or 100%, which means the probability of buying diapers, if I have beer in my shopping basket, is a hundred percent! That is pretty high and if you go back to the dataset, you can see that it is true because for every transaction involving beer, we can see diapers associated with it. Thus, you can see that making predictions and recommendations is not rocket science but just simple applied math and statistical methods on top of data. Remember that confidence is more about detecting the quality of rules predicting what can happen in the future based on the past transactional data.

The lift of an association rule is defined as the ratio of the support of the combination of two itemsets on the LHS and RHS together divided by the product of the support of each of the itemsets. The lift for a rule can be depicted as Core concepts and definitions, where L( ) denotes the lift of the rule. For our example rule, L(IS{beer} → IS{diapers}) is, Core concepts and definitions which evaluates to Core concepts and definitions giving us the value of 1.125 which is pretty decent! The lift of a rule in general is another metric to evaluate the quality of the rule. If the lift is > 1 then it indicates that the presence of the itemset in the LHS is responsible for the increase in probability that the customer is also going to buy the itemset on the RHS. This is another very important way to determine itemset associations and which items influence people to buy other items, because if the lift has a value = 1, it means that the itemsets on the LHS and RHS are independent and buying one itemset will not affect the customer to buy the other itemset. If the lift is < 1, it indicates that if the customer has an itemset on the LHS then the probability of buying the itemset on the RHS is relatively low.

Techniques used for analysis

If you have been overwhelmed by all the mathematical information in the previous section, just relax and take a deep breath! You do not need to remember everything because most of the time, the algorithms will compute everything for you! The thing where you need to be good at is using these techniques and algorithms in the right way and interpreting the results to filter out what is necessary and useful. The earlier mentioned concepts will help you when you start implementing and applying the techniques later on, which we will briefly describe in this section. We will mainly be talking about three techniques which we will be exploring in this chapter.

Evaluation of a product contingency matrix is the simplest approach to start with, which is more of a global trend capturing mechanism and shows the top most products that are being bought together in a contingency matrix. The R package arules, which we will be using later on, has a nice function called crossTable which helps in cross-tabulating the joint occurrences across pairs of items into a contingency matrix. We will use this matrix to predict which products the customers would most likely buy with some other product from the matrix.

Frequent itemset generation takes off from where product contingency matrix stops, because it has a severe limitation of not being able to deal with pairs of products at any point in time. Hence, to get into itemsets which can have any number of products and detect patterns from there, we will be building our own frequent itemset generator using machine learning! Using this, we will be able to get frequent itemsets with specific support values indicating the sets of items likely to be purchased together, and hence forming the basis of recommending products to the customers.

Finally, we will be implementing association rule mining using the wonderful Apriori algorithm which uses frequent itemsets as a part of its rule generation process. You have already seen a demo of this in the Chapter 2, Let's Help Machines Learn. However, this time we will be using its full-fledged capabilities to view the association rules between product itemsets, evaluating the quality of the rules using the metrics we discussed earlier, and also using these rules to make trend predictions and recommendations for products in shopping transactions.

Making data driven decisions

You now know what market basket analysis is, what techniques are used for it, and what results they give us. Remember that the output of market basket analysis is a set of items or products which co-occur frequently in transactions. Now this can happen because of strong support, confidence, and lift which boost its association and the customers tend to buy them, or it could also be because the retailer has placed the items together or side by side in the store or website. However, do remember that strong associations do not always happen just by chance and that is what the retailers are always trying to find out using the techniques we talked about earlier to boost sales.

The following are some crucial data driven decisions which the retailers usually tend to take based on the results obtained from market basket analysis:

  • Frequent itemsets containing pairs of products such as diapers and beer should be typically placed side by side in the store, which would give customers easy access and they would tend to buy them more.
  • Frequent itemsets which have a large number of distinct items or product counts should be placed in a specific category or theme for the itemset, such as special grocery combos or baby products. Discounts offered on the whole itemset attracts more customers.
  • Association rules having a long list of items in the itemset or products obtained from frequent itemsets or contingency matrices can be shown as product suggestions and recommendations to the customers, in specific product pages associated with the itemsets, when they browse the shopping or e-commerce website. Care should be taken that the lift of these rules be greater than 1 at least, like we discussed earlier.
  • Recommendation systems, targeted advertising, and marketing everything can be built upon the results obtained from market basket analysis.

These decisions if made at the right place and right time can help the retailers immensely in boosting their sales and making good profits.

Now that we have a solid grasp of what market basket analysis actually does and how it works, we will start by building a simple algorithm for our first technique, where we make product recommendations using a product contingency matrix based on top trending products purchased in a supermarket, and then move on to building more sophisticated analyzers and recommenders using powerful machine learning capabilities of the R language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset