Data preparation

Our data preparation task involves taking the transactions data and converting it to a form where we have product pairs and their transaction frequency. Transaction frequency is the number of transactions in which both the products have appeared. We will use these product pairs to build our graph. The vertices of our graph are the products. For every product pair, an edge is drawn in the graph between the corresponding product vertices. The weight of the edge is the transaction frequency.

We will use the arules package version 1.5-0 to help us perform this data preparation task:

> library(arules)
> transactions.obj <- read.transactions(file = 'data.csv', format = "single",
+ sep = ",",
+ cols = c("order_id", "product_id"),
+ rm.duplicates = FALSE,
+ quote = "", skip = 0,
+ encoding = "unknown")
Warning message:
In asMethod(object) : removing duplicated items in transactions
> transactions.obj
transactions in sparse format with
6988 transactions (rows) and
16793 items (columns)

We begin with reading our transactions stored in the text file and create an arules data structure called transactions. Let's look at the parameters of read.transactions, the function used to create the transactions object. The first parameter, file, we pass to our text file where we have the transactions from the retailer. The second parameter, format, can take any of two values, single or basket, depending on how the input data is organized. In our case, we have a tabular format with two columns--one column representing the unique identifier for our transaction and the other column for a unique identifier representing the products present in our transaction. This format is named as single by arules. Refer to the arules documentation for a detailed description of all the parameters. On inspecting the newly created transactions object transaction.obj, we see that there are 6,988 transactions and 16,793 products.

Now that we have loaded the transaction data, let's proceed to find the product pairs:

> support    <- 0.015
>
> # Frequent item sets
> parameters = list(
+ support = support,
+ minlen = 2, # Minimal number of items per item set
+ maxlen = 2, # Maximal number of items per item set
+ target = "frequent itemsets"
+ )
>
> freq.items <- apriori(transactions.obj, parameter = parameters)
Apriori

Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
NA 0.1 1 none FALSE TRUE 5 0.015 2 2 frequent itemsets FALSE

Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 104

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[16793 item(s), 6988 transaction(s)] done [0.02s].
sorting and recoding items ... [109 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [25 set(s)] done [0.00s].
creating S4 object ... done [0.00s].
Warning message:
In apriori(transactions.obj, parameter = parameters) :
Mining stopped (maxlen reached). Only patterns up to a length of 2 returned!
>
"Apriori is an algorithm for frequent itemset mining and association over transaction databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those itemsets appear sufficiently often in the database."  -- Wikipedia

Generating frequent itemsets is the first phase of the apriori algorithm. We conveniently leverage this phase of the algorithm to generate our product pairs and the number of transactions in which they are present.

To understand more about apriori, refer to Chapter 1, Association Rule Mining, of this book.

Let us do a quick recap from Chapter 1, Association Rule Mining.

The apriori algorithm works in two phases. Finding the frequent item sets is the first phase of the association rule mining algorithm. A group of product IDs is called an item set. The algorithm makes multiple passes to the database; in the first pass, it finds out the transaction frequency of all the individual items. These are item sets of order 1. Let's introduce the first interest measure Support here:

  • Support: As said before, support is a parameter that we pass to this algorithm--a value between 0 and 1. Let's say we set the value to 0.1. We now say an item set is considered frequent and it should be used in the subsequent phases if and only if it appears in at least 10% of the transactions.

Now in the first pass, the algorithm calculates the transaction frequency for each product. At this stage, we have order 1 item sets. We will discard all those item sets that fall below our support threshold. The assumption here is that items with high transaction frequency are more interesting than the ones with very low frequency. Items with very low support are not going to make interesting rules further down in the pipeline. Using the most frequent items, we construct item sets with two products and find their transaction frequency, that is, the number of transactions in which both the items are present. Once again, we discard all the two-product item sets, also known also item sets of order 2 that are below the given support threshold. We continue this way, until we finish. Let's look at a quick illustration:

Pass 1:

Support = 0.1
Product, transaction frequency
{item5}, 0.4
{item6}, 0.3
{item9}, 0.2
{item11}, 0.1

item11 will be discarded in this phase as its transaction frequency is below the support threshold.

Pass 2:

{item5, item6}
{item5, item9}
{item6, item9}

As you can see, we have constructed item sets of order 2 using the filtered items from pass 1. We proceed to find their transaction frequency, discard item sets falling below our minimum support threshold, and step in to pass 3, where once again, we create item sets of order 3, calculate the transaction frequency, and perform filtering and move to pass 4. In one of the subsequent passes, we will reach a stage where we cannot create higher order item sets. That is when we stop.

The apriori method is used in arules packages to get the frequent items. This method takes two parameters, one transaction object, and the second parameter is a named list. We create a named list called parameters. Inside the named list, we have an entry for our support threshold. We have set our support threshold to 0.015. The minlen and maxlen parameters set a lower and upper cutoff on how many items we expect in our item sets. By setting our minlen and maxlen to 2, we say that we need only product pairs.

The apriori method returns an itemset object. Let's now extract the product pairs and their transaction frequency from the itemset object, freq.items:

> freq.items.df <- data.frame(item_set = labels(freq.items)
+ , support = freq.items@quality)
> freq.items.df$item_set <- as.character(freq.items.df$item_set)
> head(freq.items.df)
item_set support.support support.count
1 {Banana,Honeycrisp Apple} 0.01617058 113
2 {Banana,Organic Fuji Apple} 0.01817401 127
3 {Banana,Cucumber Kirby} 0.01788781 125
4 {Banana,Strawberries} 0.01931883 135
5 {Bag of Organic Bananas,Organic Zucchini} 0.01659989 116
6 {Organic Strawberries,Organic Whole Milk} 0.01617058 113

From the itemset object, freq.items, returned by apriori, we extract our product pairs and their transaction frequency count. The item_set column in our dataframe, freq.items.df refers to the product pair, the support.count column is the actual number of transactions in which both the products were present, and the support.support column gives the support.count value as a percentage. Notice that we have our product pairs enclosed weirdly in a curly brace. We need them in two different columns.

Let's write some cleaning code to remove those braces and also separate our product pairs into two different columns:

> library(tidyr)
> freq.items.df <- separate(data = freq.items.df, col = item_set, into = c("item.1", "item.2"), sep = ",")
> freq.items.df[] <- lapply(freq.items.df, gsub, pattern='\{', replacement='')
> freq.items.df[] <- lapply(freq.items.df, gsub, pattern='\}', replacement='')
> head(freq.items.df)
item.1 item.2 support.support support.count
1 Banana Honeycrisp Apple 0.0161705781339439 113
2 Banana Organic Fuji Apple 0.0181740125930166 127
3 Banana Cucumber Kirby 0.0178878076702919 125
4 Banana Strawberries 0.0193188322839153 135
5 Bag of Organic Bananas Organic Zucchini 0.0165998855180309 116
6 Organic Strawberries Organic Whole Milk 0.0161705781339439 113

We leverage the separate function from the tidyr package to split the item_set column into two columns. As the products are separated by a comma, we specify the comma as the separator to a separate function. Once separated, we run a regular expression on those columns to remove the curly braces.

Let us now to create a new data frame with product pairs and weights:

> network.data <- freq.items.df[,c('item.1','item.2','support.count')]
> names(network.data) <- c("from","to","weight")
> head(network.data)
from to weight
1 Banana Honeycrisp Apple 113
2 Banana Organic Fuji Apple 127
3 Banana Cucumber Kirby 125
4 Banana Strawberries 135
5 Bag of Organic Bananas Organic Zucchini 116
6 Organic Strawberries Organic Whole Milk 113

We retain only the item.1, item.2, and support.count columns. Next, we rename these columns to from, to, and weight. The igraph package expects this naming convention to create a graph object seamlessly. Finally, you can see that we have modified the data suit igraph package's graph manipulation functions.

We leveraged the apriori function in arules to prepare our dataset. Equipped with our dataset, let's proceed to perform product network analysis to discover micro-categories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset