Chapter 8. Cluster Analysis

 

"Quickly bring me a beaker of wine, so that I may wet my mind and say something clever."

 
 --Aristophanes, Athenian Playwright

In the prior chapters, we focused on trying to learn the best algorithm in order to solve an outcome or response, for example, a breast cancer diagnosis or level of Prostate Specific Antigen. In all these cases, we had Y and that Y is a function of X or y = f(x). In our data, we had the actual Y values and we could train the Xs accordingly. This is referred to as supervised learning. However, there are many situations where we try to learn something from our data and either we do not have the Y or we actually choose to ignore it. If so, we enter the world of unsupervised learning. In this world, we build and select our algorithm based on how well it addresses our business needs versus how accurate it is.

Why would we try and learn without supervision? First of all, unsupervised learning can help you understand and identify patterns in your data, which may be valuable. Second, you can use it to transform your data in order to improve your supervised learning techniques. This chapter will focus on the former and the next chapter, on the latter.

So, let's begin by tackling a popular and powerful technique known as cluster analysis. With cluster analysis, the goal is to group the observations into a number of groups (k-groups), where the members in a group are as similar as possible while the members between groups are as different as possible. There are many examples of how this can help an organization; here are just a few:

  • The creation of customer types or segments
  • The detection of high-crime areas in a geography
  • Image and facial recognition
  • Genetic sequencing and transcription
  • Petroleum and geological exploration

There are many uses of cluster analysis but there are also many techniques. We will focus on the two most common: hierarchical and k-means. They are both effective clustering methods, but may not always be appropriate for the large and varied datasets that you may be called upon to analyze. Therefore, we will also examine Partitioning Around Medoids (PAM) using a Gower-based metric dissimilarity matrix as the input.

A final comment before moving on. You may be asked if these techniques are more art than science as the learning is unsupervised. I think the clear answer is, "it depends". This quote sums it up nicely:

 

"The major obstacle is the difficulty in evaluating a clustering algorithm without taking into account the context: why does the user cluster his data in the first place, and what does he want to do with the clustering afterwards? We argue that clustering should not be treated as an application-independent mathematical problem, but should always be studied in the context of its end-use."

 
 --Luxburg et al. (2012)

Hierarchical clustering

The hierarchical clustering algorithm is based on a dissimilarity measure between observations. A common measure, and what we will use, is Euclidean distance, but other distance measures are available.

Hierarchical clustering is an agglomerative or bottom-up technique. By this, we mean that all observations are their own cluster. From there, the algorithm proceeds iteratively by searching all the pairwise points and finding the two clusters that are the most similar. So, after the first iteration, there are n-1 clusters and after the second iteration, there are n-2 clusters, and so forth.

As the iterations continue, it is important to understand that in addition to the distance measure, we need to specify the linkage between the groups of observations. Different types of datasets will demand that you use different cluster linkages. As you experiment with the linkages, you may find that some may create highly unbalanced numbers of observations in one or more clusters. For example, if you have 30 observations, one technique may create a cluster of just one observation, regardless of how many total clusters that you specify. In this situation, your judgment will likely be needed to select the most appropriate linkage as it relates to the data and business case.

The following table lists the types of common linkages but note that there are others:

Linkage

Description

Ward

This minimizes the total within-cluster variance as measured by the sum of squared errors from the cluster points to its centroid

Complete

Distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster

Single

Distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster

Average

Distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster

Centroid

Distance between two clusters is the distance between the cluster centroids

The output of hierarchical clustering will be a dendrogram, which is a tree-like diagram that shows the arrangement of the various clusters.

As we will see, it can often be difficult to identify a clear-cut breakpoint in the selection of the number of clusters. Your decision should be iterative in nature and focused on the context of the business decision.

Distance calculations

As mentioned previously, Euclidean distance is commonly used to build the input for hierarchical clustering. Let's look at a simple example of how to calculate it with two observations and two variables/features.

Let's say that observation A costs $5.00 and weighs 3 pounds. Further, observation B costs $3.00 and weighs 5 pounds. We can place these values in the distance formula: distance between A and B is equal to the square root of the sum of the squared differences, which in our example would be as follows:

Distance calculations

The value of 2.83 is not a meaningful value in and of itself, but is important in the context of the other pairwise distances. This calculation is the default in R for the dist() function. You can specify other distance calculations (maximum, manhattan, canberra, binary, and minkowski) in the function. We will avoid going in detail in why or where you would choose these over Euclidean distance. This can get rather domain-specific, for example, a situation where Euclidean distance may be inadequate is where your data suffers from high-dimensionality, such as in a genomic study. It will take domain knowledge and/or trial and error on your part to determine the proper distance measure. One final note is to scale your data with a mean of zero and standard deviation of one so that the distance calculations are comparable. Any variable with a larger scale will have a larger effect on distances.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset