We will be doing a couple of things here. First, we will analyze a small toy dataset belonging to a supermarket, by using a product contingency matrix of product pair purchases based on their frequency. Then we will move on to contingency matrices based on other metrics such as support, lift, and so on by using another dataset.
The data for our first matrix consists of the six most popular products sold at the supermarket and also the number of times each product was sold by itself and in combination with the other products. We have the data in the form of a data table captured in a csv
file, as you can see in the following figure:
To analyze this data, we first need to understand what it depicts. Basically, each cell value denotes the number of times that product combination was sold. Thus, the cell combination (1, A)
denotes the product combination (milk, milk)
, which is basically the number of times milk was bought. Another example is the cell combination (4, C)
which is analogous to cell combination (3, D)
which indicates the number of times bread was bought along with butter. This is also often known as a contingency matrix and in our case it is a product contingency matrix since it deals with product data. Let us follow our standard machine learning pipeline of getting the data, analyzing it, running it on our algorithm, and getting the intended results.
Here, we will first load the dataset into memory from the disk using the following code snippet. Remember to have the top_supermarket_transactions.csv
file in the same directory from which you run the following code snippet, which is also available in the file named ch3_product
contingency matrix.R
along with this book.
> # reading in the dataset > data <- read.csv("supermarket_transactions.csv") > > # assigning row names to be same as column names > # to build the contingency matrix > row.names(data) <- data[[1]] > data <- subset(data, select = c(-1)) > > ## viewing the contingency matrix > cat("Products Transactions Contingency Matrix") Products Transactions Contingency Matrix > data
Output:
Here, we will do some exploratory analysis of the dataset to see what kind of story the data tells us. For that, we will first look at the transactions related to buying milk and bread in the following code snippet:
> ## Analyzing and visualizing the data > # Frequency of products bought with milk > data['milk', ] milk bread butter beer wine diapers milk 10000 8758 5241 300 215 753 > > # Sorting to get top products bought with milk > sort(data['milk', ], decreasing = TRUE) milk bread butter diapers beer wine milk 10000 8758 5241 753 300 215 > > # Frequency of products bought with bread > data['bread', ] milk bread butter beer wine diapers bread 8758 9562 8865 427 322 353 > > # Sorting to get top products bought with bread > sort(data['bread', ], decreasing = TRUE) bread butter milk beer diapers wine bread 9562 8865 8758 427 353 322
Thus, you can see that just by sorting the data columns we are able to see the top products which were bought in combination with bread or with milk. When recommending top products to buy from the matrix, we will remove the product from the recommendation list if that product is in the shopping cart already, because, if I buy bread, it makes no sense to recommend bread to me. Now, we will visualize the complete dataset using a mosaic plot. Do note that the product combinations which were bought very frequently will have high frequency values and will be indicated by a significant area in the mosaic plot.
> # Visualizing the data > mosaicplot(as.matrix(data), + color=TRUE, + title(main="Products Contingency Mosaic Plot"), + las=2 + )
The code generates the following mosaic plot where we apply a gradient using the color parameter and specify that axis labels be at right angles to the axis using the las
parameter to make a cleaner plot.
From the preceding plot you can note that it is now very easy to see which products were bought a large number of times in combination with another product. Ignoring the same product row and column values, we can easily deduce that product combinations such as beer and diapers were bought very frequently!
The background story about our beer – diapers combination was actually discovered by Walmart sometime back when they analyzed customer transactional data to find that, on Fridays, young American dads tend to buy beer and diapers together. They would celebrate the weekend with their friends but, having fathered an offspring, they also carried out essential duties of taking care of their children's needs. In fact, Walmart placed beer and diapers side by side in stores and their sales went up significantly! This is the power of analytics and machine learning which enables us to find out unknown and unexpected patterns.
Now we will recommend products based on the product chosen by a customer in his shopping cart. Do note that we mention this as global recommendations because these product recommendations are neither based on association rules or frequent itemsets that we will be exploring after this. They are purely based on the global product contingency matrix of product pair purchase counts. The following code snippet enables us to recommend the top two suggested products for each item from the matrix:
## Global Recommendations cat("Recommendations based on global products contingency matrix") items <- names(data) for (item in items){ cat(paste("Top 2 recommended items to buy with", item, "are: ")) item.data <- subset(data[item,], select=names(data)[!names(data) %in% item]) cat(names(item.data[order(item.data, decreasing = TRUE)][c(1,2)])) cat(" ") }
This gives us the following output:
Top 2 recommended items to buy with milk are: bread butter Top 2 recommended items to buy with bread are: butter milk Top 2 recommended items to buy with butter are: bread milk Top 2 recommended items to buy with beer are: wine diapers Top 2 recommended items to buy with wine are: beer butter Top 2 recommended items to buy with diapers are: beer milk
Thus you can see that, based on the product pair purchases from the contingency matrix, we get the top two products which people would tend to buy, based on the global trends captured in that matrix. Now we will look at some more ways to generate more advanced contingency matrices based on some other metrics.
Until now we have just used product contingency matrices based on product purchase frequencies. We will now look at creating some more contingency matrices using metrics such as support and lift, which we talked about earlier, since they are better indicators for items which have a probability of being purchased together by customers when shopping. For this we will be using the package arules
available in the Comprehensive R Archive Network (CRAN) repositories. You can download it if not present using the install.packages('arules')
command. Once it is installed, we will look at a standard grocery based transactional log database and build the contingency matrices using the standard machine learning methodology that we used in the previous chapters to work on any dataset or problem.
First, we will start by loading the required package and the data into our workspace and looking at what the transactional data looks like:
> # loading the required package > library(arules) > > # getting and loading the data > data(Groceries) > > # inspecting the first 3 transactions > inspect(Groceries[1:3]) items 1 {citrus fruit,semi-finished bread,margarine,ready soups} 2 {tropical fruit,yogurt,coffee} 3 {whole milk}
Each preceding transaction is a set of products which were purchased together, just as we had discussed in the previous sections. We will now build several contingency matrices on different matrices and view the top five product pairs which customers would be interested in buying together. The following code snippet shows us a count based product contingency matrix:
> # count based product contingency matrix > ct <- crossTable(Groceries, measure="count", sort=TRUE) > ct[1:5, 1:5]
Output:
Here we see a similar matrix to what we had worked with earlier. Now we will create a support based product contingency matrix:
> # support based product contingency matrix > ct <- crossTable(Groceries, measure="support", sort=TRUE) > ct[1:5, 1:5]
Output:
Finally, we look at another matrix based on the metric lift which we discussed earlier. If you remember, the higher the value of lift, if greater than 1, the stronger the chance of both products being bought together by customers.
> # lift based product contingency matrix > ct <- crossTable(Groceries, measure="lift", sort=TRUE) > ct[1:5, 1:5]
Output:
From the preceding matrix, you can get such insights as that people tend to buy yoghurt and whole milk together, or that soda and whole milk do not really go together since it has a lift value less than 1
. These kinds of insights help in planning product placement in stores and shopping websites for better sales and recommendations.
However, some of the main issues with this model are as follows: