Statistics for Clustering

The Describe Clusters dialog box provides information about the models that Tableau computed for clustering. You can use these statistics to assess the quality of the clustering.

When the view includes clustering, you can open the Describe Clusters dialog box by right-clicking Clusters on the Marks card (Control-clicking on a Mac) and choosing Describe Clusters. The information in the Describe Clusters dialog box is read-only, though you can click Copy to Clipboard and then paste the screen contents into a writeable document.

The Describe Clusters dialog box has two tabs: a Summary tab and a Models tab.

Describing Clusters – Summary tab

These are described in the following table:

Number of Clusters

The number of individual clusters in the clustering.

Number of Points

The number of marks in the view.

Between-group sum of squares

A metric quantifying the separation between clusters as a sum of squared distances between each cluster's centre (average value), weighted by the number of data points assigned to the cluster, and the centre of the data set. The larger the value, the better the separation between clusters.

Within-group sum of squares

A metric quantifying the cohesion of clusters as a sum of squared distances between the centre of each cluster and the individual marks in the cluster. The smaller the value, the more cohesive the clusters.

Total sum of squares

Totals the between-group sum of squares and the within-group sum of squares. The ratio (between-group sum of squares)/(total sum of squares) gives the proportion of variance explained by the model.

Cluster Statistics

For each cluster in the clustering, the following information is provided.

# Items

The number of marks within the cluster.

Centers

The average value within each cluster (shown for numeric items).

Most Common

The most common value within each cluster (only shown for categorical items).

Testing your Clustering

Since clustering models are unsupervised, they can be harder to evaluate. The clusters are created by the modeling procedure, and it's not immediately obvious how the clusters were generated.

Evaluation is a matter of checking observable summaries about the clustering. There are some key metrics that need to be taken into consideration, and they are discussed next.

Describing Clusters – Models Tab

Analysis of variance (ANOVA) is a collection of statistical models and associated procedures useful for analyzing variation within and between observations that have been partitioned into groups or clusters. In this case, analysis of variance is computed per variable, and the resulting analysis of variance table can be used to determine which variables are most effective for distinguishing clusters.

Relevant Analysis of variance statistics for Tableau clustering include:

  • F-statistic: The F-statistic for one-way, or single-factor, ANOVA is the fraction of variance explained by a variable. It is the ratio of the between-group variance to the total variance.

    The larger the F-statistic, the better the corresponding variable is distinguishing between clusters.

  • p-value: The p-value is the probability that the F-distribution of all possible values of the F-statistic takes on a value greater than the actual F-statistic for a variable. If the p-value falls below a specified significance level, then the null hypothesis (that the individual elements of the variable are random samples from a single population) can be rejected. The degrees of freedom for this F- distribution are (k - 1, N - k), where k is the number of clusters and N is the number of items (rows) clustered.

    The lower the p-value, the more the expected values of the elements of the corresponding variable differ among clusters.

  • Model Sum of Squares and degrees of freedom: The Model Sum of Squares is the ratio of the between-group sum of squares to the model degrees of freedom. The between group sum of squares is a measure of the variation between cluster means. If the cluster means are close to each other (and therefore close to the overall mean), this value will be small. The model has k-1 degrees of freedom, where k is the number of clusters.
  • Error Sum of Squares and Degrees of Freedom: The Error Sum of Squares is the ratio of within-group sum of squares to the error degrees of freedom. The within-group sum-of-squares measures the variation between observations within each cluster. The error has N-k degrees of freedom, where N is the total number of observations (rows) clustered and k is the number of clusters.

    The Error Sum of Squares can be thought of as the overall Mean Square Error, assuming that each cluster center represents the "truth" for each cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset