The Describe Clusters dialog box provides information about the models that Tableau computed for clustering. You can use these statistics to assess the quality of the clustering.
When the view includes clustering, you can open the Describe Clusters dialog box by right-clicking Clusters on the Marks card (Control-clicking on a Mac) and choosing Describe Clusters. The information in the Describe Clusters dialog box is read-only, though you can click Copy to Clipboard and then paste the screen contents into a writeable document.
The Describe Clusters dialog box has two tabs: a Summary tab and a Models tab.
These are described in the following table:
Number of Clusters |
The number of individual clusters in the clustering. |
Number of Points |
The number of marks in the view. |
Between-group sum of squares |
A metric quantifying the separation between clusters as a sum of squared distances between each cluster's centre (average value), weighted by the number of data points assigned to the cluster, and the centre of the data set. The larger the value, the better the separation between clusters. |
Within-group sum of squares |
A metric quantifying the cohesion of clusters as a sum of squared distances between the centre of each cluster and the individual marks in the cluster. The smaller the value, the more cohesive the clusters. |
Total sum of squares |
Totals the between-group sum of squares and the within-group sum of squares. The ratio (between-group sum of squares)/(total sum of squares) gives the proportion of variance explained by the model. |
Cluster Statistics |
For each cluster in the clustering, the following information is provided. |
# Items |
The number of marks within the cluster. |
Centers |
The average value within each cluster (shown for numeric items). |
Most Common |
The most common value within each cluster (only shown for categorical items). |
Since clustering models are unsupervised, they can be harder to evaluate. The clusters are created by the modeling procedure, and it's not immediately obvious how the clusters were generated.
Evaluation is a matter of checking observable summaries about the clustering. There are some key metrics that need to be taken into consideration, and they are discussed next.
Analysis of variance (ANOVA) is a collection of statistical models and associated procedures useful for analyzing variation within and between observations that have been partitioned into groups or clusters. In this case, analysis of variance is computed per variable, and the resulting analysis of variance table can be used to determine which variables are most effective for distinguishing clusters.
Relevant Analysis of variance statistics for Tableau clustering include:
The larger the F-statistic, the better the corresponding variable is distinguishing between clusters.
The lower the p-value, the more the expected values of the elements of the corresponding variable differ among clusters.
The Error Sum of Squares can be thought of as the overall Mean Square Error, assuming that each cluster center represents the "truth" for each cluster.