Identifying the customer segments in the wholesale customer data using DIANA

Hierarchical clustering algorithms are a good choice when we don't necessarily have circular (or hyperspherical) clusters in the data, and we essentially don't know the number of clusters in advance. With hierarchical clustering algorithm, unlike the flat or partitioning algorithms, there is no requirement to decide and pass the number of clusters to be formed prior to applying the algorithm on the dataset.

Hierarchical clustering results in a dendogram (tree diagram) that can be visually verified to easily determine the number of clusters. Visual verification enables us to perform cuts in the dendrogram at suitable places.

The results produced by this type of clustering algorithm are reproducible as the algorithm is not sensitive to the choice of the distance metric. In other words, irrespective of the distance metric chosen, we will get the same results. This type of clustering is also suitable for datasets of a higher complexity (quadratic) and in particular for exploring the hierarchical relationships that exist between the clusters.

Divisive hierarchical clustering, also known as DIvisive ANAlysis (DIANA), is a hierarchical clustering algorithm that follows a top-down approach to identify clusters in a given dataset. Here are the steps in DIANA to identify the clusters:

  1. All observations of the dataset are assigned to the root, so in the initial step only a single cluster is formed.
  2. In each iteration, the most heterogeneous cluster is partitioned into two.
  3. Step 2 is repeated until all the observations are in their own cluster:

Working of divisive hierarchical clustering algorithm

One obvious question that comes up is about the technique used by the algorithm to split the cluster into two. The answer is that it is performed according to some (dis)similarity measure. The Euclidean distance is used to measure the distance between two given points. This algorithm works by splitting the data on the basis of the farthest-distance measure of all the pairwise distances between the data points. Linkage defines the specific details of fartherness of the data points. The next figure illustrates the various linkages considered by DIANA for splitting the clusters. Here are some of the distances considered to split the groups:

  • Single-link: Nearest distance or single linkage
  • Complete-link: Farthest distance or complete linkage
  • Average-link: Average distance or average linkage
  • Centroid-link: Centroid distance
  • Ward's method: Sum of squared euclidean distance is minimized

Take a look at the following diagram to better understand the preceding distances:

Illustration depicting various linkage types used by DIANA

Generally, the linkage type to be used is passed as a parameter to the clustering algorithm. The cluster library offers the diana function to perform clustering. Let's apply it on our wholesale dataset with the following code:

# setting the working directory to a folder where dataset is located
setwd('/home/sunil/Desktop/chapter18/')
# reading the dataset to cust_data dataframe
cust_data = read.csv(file='Wholesale_customers_ data.csv', header = TRUE)
# removing the non-required columns
cust_data<-cust_data[,c(-1,-2)]
# including the cluster library so as to make use of diana function
library(cluster)
# Compute diana()
cust_data_diana<-diana(cust_data, metric = "euclidean",stand = FALSE)
# plotting the dendogram from diana output
pltree(cust_data_diana, cex = 0.6, hang = -1,
main = "Dendrogram of diana")
# Divise coefficient; amount of clustering structure found
print(cust_data_diana$dc)

This will give us the following output:

> print(cust_data_diana$dc)
[1] 0.9633628

Take a look at the following output:

The plot.hclust() and plot.dendrogram() functions may also be used on the DIANA clustering output. plot.dendrogram() yields the dendogram that follows the natural structure of the splits as done by the DIANA algorithm. Use the following code to generate the dendrogram:

plot(as.dendrogram(cust_data_diana), cex = 0.6,horiz = TRUE)

This will give the following output:

In the dendrogram output, each leaf that appears on the right relates to each observation in the dataset. As we traverse from right to left, observations that are similar to each other are grouped into one branch, which are themselves fused at a higher level.

The higher level of the fusion, provided on the horizontal axis, indicates the similarity between two observations. The higher the fusion, the more similar the observations are. It may be noted that conclusions about the proximity of two observations can be drawn only based on the level where branches containing those two observations are first fused. In order to identify clusters, we can cut the dendrogram at a certain level. The level at which the cut is made defines the number of clusters obtained.

We can make use of the cutree() function to obtain the cluster assignment for each of the observations in our dataset. Execute the following code to obtain the clusters and review the clustering output:

# obtain the clusters through cuttree
# Cut tree into 3 groups
grp <- cutree(cust_data_diana, k = 3)
# Number of members in each cluster
table(grp)
# Get the observations of cluster 1
rownames(cust_data)[grp == 1]

This will give the following output:

> table(grp)
grp
1 2 3
364 44 32
> rownames(cust_data)[grp == 1]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "11" "12" "13" "14" "15" "16"

"17" "18" "19" 
[19] "20" "21" "22" "25" "26" "27" "28" "31" "32" "33" "34" "35" "36" "37" "38" "41" "42" "43"
[37] "45" "49" "51" "52" "54" "55" "56" "58" "59" "60" "61" "63" "64" "65" "67" "68" "69" "70"
[55] "71" "72" "73" "74" "75" "76" "77" "79" "80" "81" "82" "83" "84" "85" "89" "90" "91" "92"
[73] "94" "95" "96" "97" "98" "99" "100" "101" "102" "103" "105" "106" "107" "108" "109" "111" "112" "113"
[91] "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "127" "128" "129" "131" "132" "133" "134"
[109] "135" "136" "137" "138" "139" "140" "141" "142" "144" "145" "147" "148" "149" "151" "152" "153" "154" "155"
[127] "157" "158" "159" "160" "161" "162" "163" "165" "167" "168" "169" "170" "171" "173" "175" "176" "178" "179"
[145] "180" "181" "183" "185" "186" "187" "188" "189" "190" "191" "192" "193" "194" "195" "196" "198" "199" "200"
[163] "203" "204" "205" "207" "208" "209" "211" "213" "214" "215" "216" "218" "219" "220" "221" "222" "223" "224"
[181] "225" "226" "227" "228" "229" "230" "231" "232" "233" "234" "235" "236" "237" "238" "239" "241" "242" "243"
[199] "244" "245" "246" "247" "248" "249" "250" "251" "253" "254" "255" "257" "258" "261" "262" "263" "264" "265"
[217] "266" "268" "269" "270" "271" "272" "273" "275" "276" "277" "278" "279" "280" "281" "282" "284" "287" "288"
[235] "289" "291" "292" "293" "294" "295" "296" "297" "298" "299" "300" "301" "303" "304" "306" "308" "309" "311"
[253] "312" "314" "315" "316" "317" "318" "319" "321" "322" "323" "324" "325" "327" "328" "329" "330" "331" "333"
[271] "335" "336" "337" "338" "339" "340" "341" "342" "343" "345" "346" "347" "348" "349" "351" "353" "355" "356"
[289] "357" "358" "359" "360" "361" "362" "363" "364" "365" "366" "367" "368" "369" "370" "372" "373" "374" "375"
[307] "376" "377" "379" "380" "381" "382" "384" "385" "386" "387" "388" "389" "390" "391" "392" "393" "394" "395"
[325] "396" "397" "398" "399" "400" "401" "402" "403" "404" "405" "406" "407" "409" "410" "411" "412" "413" "414"
[343] "415" "416" "417" "418" "420" "421" "422" "423" "424" "425" "426" "427" "429" "430" "431" "432" "433" "434"
[361] "435" "436" "439" "440"

We can also visualize the clustering output through the fviz_cluster function in the factoextra library. Use the following code to get the required visualization:

library(factoextra)
fviz_cluster(list(data = cust_data, cluster = grp))

This will give you the following output:

It is also possible to color-code the clusters within the dendogram itself. This can be accomplished with the following code:

plot(as.hclust(cust_data_diana))
rect.hclust(cust_data_diana, k = 4, border = 2:5)

This will give the following output:

Now that the clusters are identified, the steps we discussed to evaluate the cluster quality (through the Silhouette index) apply here as well. As we have already covered this topic under the k-means clustering algorithm, we are not going to repeat the steps here. The code and interpretation of the output remains the same as what was discussed under k-means.

As discussed earlier, the cluster's output is not the final point to customer segmentation exercise we have on hand. Similar to the discussion we had on under the k-means algorithm, we could analyze the DIANA cluster output to identify meaningful segments so as to roll out business objectives to those specifically-identified segments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset