Frequent itemset generation

We will now look at a better technique to find patterns and detect frequently bought products. For this, we will be using the frequent itemset generation technique. We will be implementing this algorithm from scratch because, even though when we solve any machine learning or optimization problem we usually use readymade machine learning algorithms out of the box which are optimized and available in various R packages, one of the main objectives of this book is to make sure we understand what exactly goes on behind the scenes of a machine learning algorithm. Thus, we will see how we can build some of these algorithms ourselves using the principles of mathematics, statistics, and logic.

Getting started

The data we will be using for this is the shopping_transaction_log.csv dataset which we used to explain the concepts of market basket analysis at the beginning of the chapter. The code we will be using for this section is available in the ch3_frequent itemset generation.R file. We will first go through all the functions and then define the main function which utilizes all the helper functions to define a workflow for frequent itemset generation.

We will start by loading some library dependencies and utility functions:

## load library dependencies 
library(dplyr)  # manipulating data frames
library(gridExtra)  # output clean formatted tables

## Utility function: Appends vectors to a list
list.append <- function (mylist, ...){
  mylist <- c(mylist, list(...))
  return(mylist)
}

Data retrieval and transformation

Next, we will define the functions for getting the data and transforming it into the required format of a data frame consisting of products and purchase frequency. We also have a function to prune this data frame if we want to remove products below a certain purchase frequency threshold.

## Step 1: Function to read the dataset into memory from file
get_transaction_dataset <- function(filename){
  df <- read.csv(filename, header = FALSE)
  dataset <- list()
  for (index in seq(nrow(df))){
    transaction.set <- as.vector(unlist(df[index,]))
    transaction.set <- transaction.set[transaction.set != ""]
    dataset <- list.append(dataset, transaction.set)
  }
  return(dataset)
}  

## Step 2: Function to convert dataset into a data frame
get_item_freq_table <- function(dataset){
  item.freq.table <- unlist(dataset) %>% table %>% data.frame
  return (item.freq.table)
}

## Step 3: Function to prune items based on minimum frequency
##         as specified by the user.
##         Here min freq <- item.min.freq
prune_item_freq_table <- function(item.freq.table, item.min.freq){
  pruned.item.table <- item.freq.table[item.freq.table$Freq >= 
                                       item.min.freq,]
  return (pruned.item.table)
}

Building an itemset association matrix

Now, we will implement three functions to help us build the itemset association matrix. We start with building the first function, which returns us different unique itemset combinations from the list of items in our transactional dataset based on the number of items in each itemset passed as a parameter. This helps us in getting itemsets of a particular count.

## Step 4: Function to get possible itemset combinations where 
##         each itemset has n number of items where n is specified ##         by the user. Here n <- num.items 
get_associated_itemset_combinations <- function(pruned.item.table, 
                                                num.items){
  itemset.associations <- c()
  itemset.association.matrix <- combn(pruned.item.table$., 
                                      num.items)
  for (index in seq(ncol(itemset.association.matrix))){
    itemset.associations <- c(itemset.associations,
                        paste(itemset.association.matrix[,index],
                                    collapse = ", ")
                            )
  }
  itemset.associations <- unique(itemset.associations)
  return (itemset.associations)
}

The following function builds a frequency contingency table showing the occurrence of each itemset in each transaction from the dataset. This forms the basis of getting the data for building our frequent itemsets. The itemset association matrix shows on a high level the occurrence of the different unique itemsets generated in the previous function per transaction in our dataset.

## Step 5: Function to build an itemset association matrix where ##         we see a contingency table showing itemset association 
##         occurrence in each transaction of the dataset
build_itemset_association_matrix <- function(dataset,   
                                      itemset.association.labels,
                                      itemset.combination.nums){  
  itemset.transaction.labels <- sapply(dataset, paste, 
                                       collapse=", ")
  itemset.associations <- lapply(itemset.association.labels, 
                              function(itemset){
                                unlist(strsplit(itemset, ", ", 
                                                fixed = TRUE)
                                       )
                              }
                          )
  # building the itemset association matrix
  association.vector <- c()
  for (itemset.association in itemset.associations){
    association.vector <- c(association.vector,
           unlist(
             lapply(dataset, 
                    function(dataitem,  
                             num.items=itemset.combination.nums){ 
                      m <- match(dataitem, itemset.association)
                      m <- length(m[!is.na(m)])
                      if (m == num.items){
                        1
                      }else{
                        NA
                      }
                    }
             )
           )
    )
  }
  
  itemset.association.matrix <- matrix(association.vector, 
                                       nrow = length(dataset))
  itemset.association.labels <- sapply(itemset.association.labels, 
                                       function(item) {
                                         paste0('{', paste(item, 
                                           collapse = ', '), '}')
                                       }
                                )  

  itemset.transaction.labels <- sapply(dataset, 
                                    function(itemset){
                                      paste0('{', paste(itemset, 
                                          collapse = ', '), '}')
                                    }
                                )
  colnames(itemset.association.matrix) <- itemset.association.labels
  rownames(itemset.association.matrix) <- itemset.transaction.labels
  
  return (itemset.association.matrix)
}

Once we have the itemset association matrix, we use it in the following function, to sum up these individual itemset occurrences to get the total occurrence of each itemset in the whole dataset:

## Step 6: Function to generate total occurrences of each itemset 
##         in the transactional dataset based on data from the 
##         association matrix
get_frequent_itemset_details <- function(itemset.association.matrix){
  frequent.itemsets.table <- apply(itemset.association.matrix, 
                                   2, sum, na.rm=TRUE)
  return (frequent.itemsets.table)
}

Creating a frequent itemsets generation workflow

Finally, we will define the function which will utilize all the previous functions to create a workflow for generating the frequent itemsets. The main parameters we will be taking here include data.file.path which contains the location of the dataset, itemset.combination.nums which denotes the number of items which should be in each itemset, item.min.freq which denotes the minimum purchase count threshold of each item, and minsup which tells us the minimum support for the generated frequent itemsets.

## Step 7: Function containing entire workflow to generate 
##         frequent itemsets
frequent.itemsets.generator <- function(data.file.path, 
                                     itemset.combination.nums=2, 
                                     item.min.freq=2, minsup=0.2){
  # get the dataset
  dataset <- get_transaction_dataset(data.file.path)
  
  # convert data into item frequency table
  item.freq.table <- get_item_freq_table(dataset)
  pruned.item.table <- prune_item_freq_table(item.freq.table, 
                                             item.min.freq)
  
  # get itemset associations
  itemset.association.labels <- get_associated_itemset_combinations(pruned.item.table,
                                   itemset.combination.nums)
  itemset.association.matrix <- build_itemset_association_matrix(dataset, 
                                itemset.association.labels, 
                                itemset.combination.nums)
  
  # generate frequent itemsets
  frequent.itemsets.table <- get_frequent_itemset_details(itemset.association.matrix)
  frequent.itemsets.table <- sort(frequent.itemsets.table[frequent.itemsets.table > 0], 
                                  decreasing = TRUE)
  
  frequent.itemsets.names <- names(frequent.itemsets.table)
  frequent.itemsets.frequencies <- as.vector(frequent.itemsets.table)
  frequent.itemsets.support <- round((frequent.itemsets.frequencies * 100) / length(dataset), 
                                     digits=2)
  
  frequent.itemsets <- data.frame(Itemset=frequent.itemsets.names,
                          Frequency=frequent.itemsets.frequencies,
                          Support=frequent.itemsets.support)
  # apply minimum support cutoff to get frequent itemsets
  minsup.percentage <- minsup * 100
  frequent.itemsets <- subset(frequent.itemsets,    
  frequent.itemsets['Support'] >= minsup.percentage)
  frequent.itemsets.support <- sapply(frequent.itemsets.support,
                                      function(value){
                                        paste0(value,'%')
                                      }
                               )
  
  # printing to console
  cat("
Item Association Matrix
")
  print(itemset.association.matrix)
  cat("

")
  cat("
Valid Frequent Itemsets with Frequency and Support
")
  print(frequent.itemsets)
  
  # displaying frequent itemsets as a pretty table
  if (names(dev.cur()) != "null device"){
    dev.off()
  }
  grid.table(frequent.itemsets)
}

Detecting shopping trends

Now it's time to test our algorithm! We will first generate all the frequent itemsets that have two items where each item has been purchased at least three times in the overall dataset and have a minimum support of at least 20%. To do this, you will have to fire up the following function in the R console. Do remember to load all the previous functions in memory first.

> frequent.itemsets.generator(
         data.file.path='shopping_transaction_log.csv',    
         itemset.combination.nums=2, item.min.freq=3, minsup=0.2)

We get the following itemset contingency matrix, which is used to generate the frequent itemsets. The left side rows indicate the transactions and each column represents an itemset.

Detecting shopping trends

The final frequent itemsets will be shown both in the console and in the plot section in the form of a pretty table, as follows:

Detecting shopping trends

Thus, you can clearly see that the itemset {beer, diapers} is our most frequent itemset with a support of approximately 67%, which has occurred six times in total in our dataset, and the association matrix shows you the exact transactions where it has occurred. Thus, this function detects a trend of people buying beer and diapers or diapers and milk more frequently, and thus we can recommend people the same when they are shopping. We will also take a look at the frequent itemsets containing three items next:

> frequent.itemsets.generator(
         data.file.path='shopping_transaction_log.csv',
         itemset.combination.nums=3, item.min.freq=1, minsup=0.2)

This gives us the following table showing the frequent itemsets with their necessary statistics:

Detecting shopping trends

Thus we see that we get two frequent itemsets with support greater than 20%. Of course remember that this is a small dataset and the bigger the dataset you have containing purchase transactions, the more patterns you will get with stronger support.

We have successfully built an algorithm for generating frequent itemsets! You can use this same algorithm on new datasets to generate more and more frequent itemsets and then we can start recommending products for people to purchase as soon as we see them buying one or more items from any of the frequent itemsets. A simple example would be if we see people buying beer, we can recommend diapers and milk to them since that shopping trend was detected by our algorithm in the frequent itemsets earlier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset