Product network analysis

There are two steps in product network analysis. The first step is to transform the point-of-sale data into product pairs and their transaction frequency. The second step is to create a graph using data from the first step and run a clustering algorithm on the graph. The subgraphs or the clusters formed are presented as the micro-categories. Also, some products in the graph play key roles. Clustering and visualizing these product subgraphs will also help us identify those key products. According to the white paper by Corte Consultancy, a product fitting any of the following definitions is considered as key to the network:

The core product: In a subgraph or a cluster group, the product that is most commonly purchased in the group is termed as the core product of that group.
The connectors: These are products that connect two subgraphs or clusters together. They are the ones that are typically bought first, if a customer starts shopping for products in that subgraph. Promoting this product as a part of cross-selling can influence customers who have never bought any products from this subgraph of products to start purchasing products present in this subgraph.

With our network.data data frame prepared in the exact manner we need, let's proceed to build a graph of products:

> set.seed(29)
> my.graph <- graph_from_data_frame(network.data)
> plot.igraph(my.graph,
+             layout=layout.fruchterman.reingold,
+             vertex.label.cex=.5,
+             edge.arrow.size=.1)
>

The preceding plot visualizes our graph.

Having built the graph, let's proceed to perform the clustering exercise, the second step in product network analysis. The clustering of graphs is performed by community detection algorithms. Communities discovered by the graphs are considered as clusters. Though there are some small nuances in equating communities to clusters, in this chapter, we will stick to the simple principle that the communities we discover in our graphs are going to be the clusters we want to consider for our product network analysis.

Refer to the paper, Detecting Community Structure in Networks, by M. E. J. Newman, for in-depth analysis of clusters and communities.

Community in a graph: In a graph, community is a subset of vertices that are completely connected to each other. The correct technical term for it is clique. It is, ideally, groups of vertices within which connections are dense, but between which connections are sparser. It sounds very similar to the cluster cohesiveness requirement.

Community detection algorithms: These are a group of algorithms that, when applied on a graph, produce the most likely communities present in the graph. Traditional algorithms, such as hierarchical clustering, use similarity-based approaches to arrive at the communities. The more recent algorithms leverage several structural properties of the graph.

Walktrap algorithm: This algorithm is based on the random walk principle. The fundamental assumption with the random walk algorithm is as follows. A community typically has very few edges leading outside of the community. So if we perform a random walk from a vertex, we are more likely to stay within the community. We will use this algorithm to detect communities for our use case.

Let's go ahead and use the walktrap algorithm to discover communities/clusters:

> random.cluster <- walktrap.community(my.graph)

> random.cluster
IGRAPH clustering walktrap, groups: 2, mod: 0.26
+ groups:
  $`1`
  [1] "Banana"             "Large Lemon"        "Organic Avocado"    "Honeycrisp Apple"  
  [5] "Organic Fuji Apple" "Cucumber Kirby"     "Strawberries"       "Organic Whole Milk"
  [9] "Limes"             
  
  $`2`
  [1] "Bag of Organic Bananas" "Organic Strawberries"   "Organic Hass Avocado"   "Organic Raspberries"   
  [5] "Organic Baby Spinach"   "Organic Zucchini"      
  
> groupings.df <- data.frame(products = random.cluster$names, group = random.cluster$membership)
> head(groupings.df)
                products group
1                 Banana     1
2 Bag of Organic Bananas     2
3   Organic Strawberries     2
4   Organic Hass Avocado     2
5    Organic Raspberries     2
6            Large Lemon     1
> groupings.df[groupings.df$group == 2,]
                 products group
2  Bag of Organic Bananas     2
3    Organic Strawberries     2
4    Organic Hass Avocado     2
5     Organic Raspberries     2
8    Organic Baby Spinach     2
13       Organic Zucchini     2
> groupings.df[groupings.df$group == 1,]
             products group
1              Banana     1
6         Large Lemon     1
7     Organic Avocado     1
9    Honeycrisp Apple     1
10 Organic Fuji Apple     1
11     Cucumber Kirby     1
12       Strawberries     1
14 Organic Whole Milk     1
15              Limes     1

We pass our graph to the walktrap.community function. We then move the cluster membership and cluster members to a data frame, groupings.df. Our random walk algorithm produced two subgraphs or communities.

Let's proceed to plot these clusters:

> plot(random.cluster,my.graph,
+             layout=layout.fruchterman.reingold,
+             vertex.label.cex=.5,
+             edge.arrow.size=.1)

Let's start with the large cluster in the left side of the picture. It resembles a star schema, with Banana occupying the center. Let's confirm this by looking at some properties of the Banana vertex.

Let's look at the adjacency matrix of this graph:

> get.adjacency(my.graph)
15 x 15 sparse Matrix of class "dgCMatrix"
   [[ suppressing 15 column names 'Banana', 'Bag of Organic Bananas', 'Organic Strawberries' ... ]]
                                                    
Banana                 . . 1 1 . 1 1 1 1 1 1 1 . 1 1
Bag of Organic Bananas . . 1 1 1 . . 1 . . . . 1 . .
Organic Strawberries   . . . . . . . . . . . . . 1 .
Organic Hass Avocado   . . 1 . 1 . . . . . . . . . .
Organic Raspberries    . . 1 . . . . . . . . . . . .
Large Lemon            . . . . . . . . . . . . . . 1
Organic Avocado        . . 1 . . . . 1 . . . . . . .
Organic Baby Spinach   . . 1 1 . . . . . . . . . . .
Honeycrisp Apple       . . . . . . . . . . . . . . .
Organic Fuji Apple     . . . . . . . . . . . . . . .
Cucumber Kirby         . . . . . . . . . . . . . . .
Strawberries           . . . . . . . . . . . . . . .
Organic Zucchini       . . . . . . . . . . . . . . .
Organic Whole Milk     . . . . . . . . . . . . . . .
Limes                  . . . . . . . . . . . . . . .
>

If we quickly look at the adjacency matrix, it appears that Banana is the Core Product for this cluster. Banana is connected to most of the other products.

Let's look at the strength and degree for the Banana vertex:

> strength(my.graph)
                Banana Bag of Organic Bananas   Organic Strawberries   Organic Hass Avocado 
                  1697                    941                   1191                    789 
   Organic Raspberries            Large Lemon        Organic Avocado   Organic Baby Spinach 
                   414                    288                    422                    845 
      Honeycrisp Apple     Organic Fuji Apple         Cucumber Kirby           Strawberries 
                   113                    127                    125                    135 
      Organic Zucchini     Organic Whole Milk                  Limes 
                   116                    233                    270 
> degree(my.graph)
                Banana Bag of Organic Bananas   Organic Strawberries   Organic Hass Avocado 
                    11                      5                      7                      5 
   Organic Raspberries            Large Lemon        Organic Avocado   Organic Baby Spinach 
                     3                      2                      3                      5 
      Honeycrisp Apple     Organic Fuji Apple         Cucumber Kirby           Strawberries 
                     1                      1                      1                      1 
      Organic Zucchini     Organic Whole Milk                  Limes 
                     1                      2                      2 
>

Banana has very high degree and strength, and thus there is more evidence to assign Banana as the Core Product for this cluster. Though it was evident that Banana is the most important product in the cluster from the graph plot, it gives us more confidence to look at some of the properties of the Banana vertex.

Having decided upon the Core Product as banana for the left cluster, let's do some random walking from the Banana vertex. We select an edge uniformly and randomly from all the out-degrees of Banana and do a specified number of steps. To find out more about random walk click http://igraph.org/r/doc/random_walk.html.

> for(var in seq(1,5)){
+   print (random_walk(my.graph, start = "Banana", steps = 2))
+ }
+ 2/15 vertices, named, from 16e82b7:
[1] Banana               Organic Strawberries
+ 2/15 vertices, named, from 16e82b7:
[1] Banana          Organic Avocado
+ 2/15 vertices, named, from 16e82b7:
[1] Banana         Cucumber Kirby
+ 2/15 vertices, named, from 16e82b7:
[1] Banana Limes 
+ 2/15 vertices, named, from 16e82b7:
[1] Banana       Strawberries
> random_walk(my.graph, start = "Banana", steps = 3)
+ 3/15 vertices, named, from 16e82b7:
[1] Banana               Organic Baby Spinach Organic Hass Avocado
> random_walk(my.graph, start = "Banana", steps = 4)
+ 4/15 vertices, named, from 16e82b7:
[1] Banana               Organic Avocado      Organic Baby Spinach Organic Strawberries
> random_walk(my.graph, start = "Banana", steps = 5)
+ 4/15 vertices, named, from 16e82b7:
[1] Banana               Organic Avocado      Organic Strawberries Organic Whole Milk  
>

This random walk can expose some of the product affinities quickly without a lot of computation. Imagine working on large graphs with thousands of vertices; these random walks performed multiple times can quickly unearth product affinities.

Another data source that can be used to experiment with the techniques described in this chapter is Amazon's online retail page which is available for download.
https://snap.stanford.edu/data/amazon0302.html

The Organic Avacado and Banana products in the left cluster can be termed as The Connector. They have a path into the right-side cluster. Promoting these products can induce the customers of the left cluster to start exploring or buying the products in the right cluster.

This brings us to the end of this section. Here, we were able to use the random walk principle-based community detection algorithm to find communities in our graph. These communities can be the new micro-categories. Further, we also identified our core and connector vertices.

Table of Contents for Product network analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Product network analysis