The hierarchical clustering algorithm is based on a dissimilarity measure between observations. A common measure, and what we will use, is Euclidean distance. Other distance measures are also available.
As the iterations continue, it is important to understand that in addition to the distance measure, we need to specify the linkage between the groups of observations. Different types of data will demand that you use different cluster linkages. As you experiment with the linkages, you may find that some may create highly unbalanced numbers of observations in one or more clusters. For example, if you have 30 observations, one technique may create a cluster of just one observation, regardless of how many total clusters that you specify. In this situation, your judgment will likely be needed to select the most appropriate linkage as it relates to the data and business case.
The following table lists the types of common linkages, but note that there are others:
Linkage | Description |
Ward | This minimizes the total within-cluster variance as measured by the sum of squared errors from the cluster points to its centroid |
Complete | The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster |
Single | The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster |
Average | The distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster |
Centroid | The distance between two clusters is the distance between the cluster centroids |
As we will see, it can often be difficult to identify a clear-cut breakpoint in the selection of the number of clusters. Once again, your decision should be iterative in nature and focused on the context of the business decision.