Cluster Analysis

"Quickly bring me a beaker of wine, so that I may wet my mind and say something clever."

- Aristophanes, Athenian Playwright

In the earlier chapters, we focused on trying to learn the best algorithm in order to solve an outcome or response, for example, customer satisfaction or home prices. In all these cases, we had y, and that y is a function of x, or y = f(x). In our data, we had the actual y values and we could train x accordingly. This is referred to as supervised learning. However, there are many situations where we try to learn something from our data, and either we do not have the y, or we actually choose to ignore it. If so, we enter the world of unsupervised learning. In this world, we build and select our algorithm based on how well it addresses our business needs versus how accurate it is.

Why would we try and learn without supervision? First of all, unsupervised learning can help you understand and identify patterns in your data, which may be valuable. Second, you can use it to transform your data in order to improve your supervised learning techniques.

This chapter will focus on the former and the next chapter on the latter.

So, let's begin by tackling a popular and powerful technique known as cluster analysis. With cluster analysis, the goal is to group the observations into a number of groups (k-groups), where the members in a group are as similar as possible while the members between groups are as different as possible. There are many examples of how this can help an organization; here are just a few:

The creation of customer types or segments
The detection of high-crime areas in a geography
Image and facial recognition
Genetic sequencing and transcription
Petroleum and geological exploration

There are many uses of cluster analysis, but there are also many techniques. We will focus on the two most common: hierarchical and k-means. They are both effective clustering methods, but may not always be appropriate for the large and varied datasets that you may be called upon to analyze. Therefore, we will also examine partitioning around medoids (PAM) using a Gower-based metric dissimilarity matrix as the input. Finally, we will examine a new methodology I recently learned and applied using random forest to transform your data. The transformed data can then be used as an input to unsupervised learning.

A final comment before moving on: you may be asked whether these techniques are more art than science, as the learning is unsupervised. I think the clear answer is, it depends. In early 2016, I presented the methods here at a meeting of the Indianapolis, Indiana R-User Group. To a person, we all agreed that it is the judgment of the analysts and the business users that makes unsupervised learning meaningful and determines whether you have, say, three versus four clusters in your final algorithm. This quote sums it up nicely:

"The major obstacle is the difficulty in evaluating a clustering algorithm without taking into account the context: why does the user cluster his data in the first place, and what does he want to do with the clustering afterwards? We argue that clustering should not be treated as an application-independent mathematical problem, but should always be studied in the context of its end-use."

- Luxburg et al. (2012)

The following are the topics that we will be covering in this chapter:

Hierarchical clustering
K-means clustering
Gower and PAM
Random forests
Dataset background
Data understanding and preparation
Modeling

Table of Contents for Cluster Analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Cluster Analysis