Evaluating a product contingency matrix

We will be doing a couple of things here. First, we will analyze a small toy dataset belonging to a supermarket, by using a product contingency matrix of product pair purchases based on their frequency. Then we will move on to contingency matrices based on other metrics such as support, lift, and so on by using another dataset.

The data for our first matrix consists of the six most popular products sold at the supermarket and also the number of times each product was sold by itself and in combination with the other products. We have the data in the form of a data table captured in a csv file, as you can see in the following figure:

Evaluating a product contingency matrix

To analyze this data, we first need to understand what it depicts. Basically, each cell value denotes the number of times that product combination was sold. Thus, the cell combination (1, A) denotes the product combination (milk, milk), which is basically the number of times milk was bought. Another example is the cell combination (4, C) which is analogous to cell combination (3, D) which indicates the number of times bread was bought along with butter. This is also often known as a contingency matrix and in our case it is a product contingency matrix since it deals with product data. Let us follow our standard machine learning pipeline of getting the data, analyzing it, running it on our algorithm, and getting the intended results.

Getting the data

Here, we will first load the dataset into memory from the disk using the following code snippet. Remember to have the top_supermarket_transactions.csv file in the same directory from which you run the following code snippet, which is also available in the file named ch3_product contingency matrix.R along with this book.

> # reading in the dataset
> data <- read.csv("supermarket_transactions.csv")
> 
> # assigning row names to be same as column names
> # to build the contingency matrix
> row.names(data) <- data[[1]]
> data <- subset(data, select = c(-1))
>
> ## viewing the contingency matrix
> cat("Products Transactions Contingency Matrix")
Products Transactions Contingency Matrix
> data

Output:

Getting the data

Analyzing and visualizing the data

Here, we will do some exploratory analysis of the dataset to see what kind of story the data tells us. For that, we will first look at the transactions related to buying milk and bread in the following code snippet:

> ## Analyzing and visualizing the data
> # Frequency of products bought with milk
> data['milk', ]
      milk bread butter beer wine diapers
milk 10000  8758   5241  300  215     753
> 
> # Sorting to get top products bought with milk
> sort(data['milk', ], decreasing = TRUE)
      milk bread butter diapers beer wine
milk 10000  8758   5241     753  300  215
> 
> # Frequency of products bought with bread
> data['bread', ]
      milk bread butter beer wine diapers
bread 8758  9562   8865  427  322     353
> 
> # Sorting to get top products bought with bread
> sort(data['bread', ], decreasing = TRUE)
      bread butter milk beer diapers wine
bread  9562   8865 8758  427     353  322

Thus, you can see that just by sorting the data columns we are able to see the top products which were bought in combination with bread or with milk. When recommending top products to buy from the matrix, we will remove the product from the recommendation list if that product is in the shopping cart already, because, if I buy bread, it makes no sense to recommend bread to me. Now, we will visualize the complete dataset using a mosaic plot. Do note that the product combinations which were bought very frequently will have high frequency values and will be indicated by a significant area in the mosaic plot.

> # Visualizing the data
> mosaicplot(as.matrix(data), 
+            color=TRUE, 
+            title(main="Products Contingency Mosaic Plot"),
+            las=2
+            )

The code generates the following mosaic plot where we apply a gradient using the color parameter and specify that axis labels be at right angles to the axis using the las parameter to make a cleaner plot.

Analyzing and visualizing the data

From the preceding plot you can note that it is now very easy to see which products were bought a large number of times in combination with another product. Ignoring the same product row and column values, we can easily deduce that product combinations such as beer and diapers were bought very frequently!

Note

The background story about our beer – diapers combination was actually discovered by Walmart sometime back when they analyzed customer transactional data to find that, on Fridays, young American dads tend to buy beer and diapers together. They would celebrate the weekend with their friends but, having fathered an offspring, they also carried out essential duties of taking care of their children's needs. In fact, Walmart placed beer and diapers side by side in stores and their sales went up significantly! This is the power of analytics and machine learning which enables us to find out unknown and unexpected patterns.

Global recommendations

Now we will recommend products based on the product chosen by a customer in his shopping cart. Do note that we mention this as global recommendations because these product recommendations are neither based on association rules or frequent itemsets that we will be exploring after this. They are purely based on the global product contingency matrix of product pair purchase counts. The following code snippet enables us to recommend the top two suggested products for each item from the matrix:

## Global Recommendations
cat("Recommendations based on global products contingency matrix")
items <- names(data)
for (item in items){
  cat(paste("Top 2 recommended items to buy with", item, "are: "))
  item.data <- subset(data[item,], select=names(data)[!names(data) %in% item])
  cat(names(item.data[order(item.data, decreasing = TRUE)][c(1,2)]))
  cat("
")
}

This gives us the following output:

Top 2 recommended items to buy with milk are: bread butter
Top 2 recommended items to buy with bread are: butter milk
Top 2 recommended items to buy with butter are: bread milk
Top 2 recommended items to buy with beer are: wine diapers
Top 2 recommended items to buy with wine are: beer butter
Top 2 recommended items to buy with diapers are: beer milk

Thus you can see that, based on the product pair purchases from the contingency matrix, we get the top two products which people would tend to buy, based on the global trends captured in that matrix. Now we will look at some more ways to generate more advanced contingency matrices based on some other metrics.

Advanced contingency matrices

Until now we have just used product contingency matrices based on product purchase frequencies. We will now look at creating some more contingency matrices using metrics such as support and lift, which we talked about earlier, since they are better indicators for items which have a probability of being purchased together by customers when shopping. For this we will be using the package arules available in the Comprehensive R Archive Network (CRAN) repositories. You can download it if not present using the install.packages('arules') command. Once it is installed, we will look at a standard grocery based transactional log database and build the contingency matrices using the standard machine learning methodology that we used in the previous chapters to work on any dataset or problem.

First, we will start by loading the required package and the data into our workspace and looking at what the transactional data looks like:

> # loading the required package
> library(arules)
> 
> # getting and loading the data
> data(Groceries)
> 
> # inspecting the first 3 transactions 
> inspect(Groceries[1:3])
  items                                                   
1 {citrus fruit,semi-finished bread,margarine,ready soups}
2 {tropical fruit,yogurt,coffee}                          
3 {whole milk}

Each preceding transaction is a set of products which were purchased together, just as we had discussed in the previous sections. We will now build several contingency matrices on different matrices and view the top five product pairs which customers would be interested in buying together. The following code snippet shows us a count based product contingency matrix:

> # count based product contingency matrix 
> ct <- crossTable(Groceries, measure="count", sort=TRUE)
> ct[1:5, 1:5]

Output:

Advanced contingency matrices

Here we see a similar matrix to what we had worked with earlier. Now we will create a support based product contingency matrix:

> # support based product contingency matrix 
> ct <- crossTable(Groceries, measure="support", sort=TRUE)
> ct[1:5, 1:5]

Output:

Advanced contingency matrices

Finally, we look at another matrix based on the metric lift which we discussed earlier. If you remember, the higher the value of lift, if greater than 1, the stronger the chance of both products being bought together by customers.

> # lift based product contingency matrix 
> ct <- crossTable(Groceries, measure="lift", sort=TRUE)
> ct[1:5, 1:5]

Output:

Advanced contingency matrices

From the preceding matrix, you can get such insights as that people tend to buy yoghurt and whole milk together, or that soda and whole milk do not really go together since it has a lift value less than 1. These kinds of insights help in planning product placement in stores and shopping websites for better sales and recommendations.

However, some of the main issues with this model are as follows:

  • High number of products leads to a huge matrix which is difficult to work with since it needs more time and space to process.
  • Can detect pairs of items in frequent itemsets only for recommendations. It is possible to find out combinations of more than two items from this model but that needs additional logic related to set theory.
  • Faces the cold start problem, typically known in recommender engines, which happens when a new product is launched and we cannot predict recommendations or how it will sell in the market since our historical data does not have any information associated with it.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset