Generalization of two well-known indices in the choice of the number of clusters within an Agglomerative Hierarchical Clustering

Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/24/2023: 4:15 PM - 4:20 PM CDT
Lightning 

Description

Agglomerative Hierarchical Clustering (AHC) is a very popular statistical method that allows objects to be grouped into homogeneous clusters. This clustering method has the advantage of being able to represent its steps in a graphical form with a dendrogram.

In such a cluster analysis, the choice of the number of clusters is crucial. In practice, statistical experts often use the dendrogram to determine it. Indeed, a large gap in the dendrogram characterises two heterogeneous clusters, whereas a small gap implies that the two aggregated clusters are close.

Much has been written to help expert and non-expert users choose the correct number of clusters with many indices proposed (Charrad et al., 2014). However, the vast majority of these indices have been constructed to satisfy the most classical case: Euclidean distance and Ward's criterion. Although they perform well in this case, they become obsolete when the distance or the aggregation method changes. Indeed, depending on the type of the data (numerical, categorical) and preferences, the user can use a distance such as Canberra or an aggregation method such as single linkage while needing guidance on the number of clusters to choose. Therefore, we propose to generalise two well-known indices to any distance and aggregation method: the Hartigan index (Hartigan, 1975) and the Calinski-Harabasz index (Calinski & Harabasz, 1974). As we demonstrate, these indices can be obtained directly from the dendrogram values in the Euclidean/Ward's case. Moreover, they are related to the heterogeneity gap, which is usually interpreted graphically by experts. Thanks to these properties, we show that we can generalise these two indices by directly using the dendrogram values, regardless of the distance and aggregation method chosen by the user.

Finally, the limitations of using the two raw indices outside the Euclidean/Ward context and the benefits of the proposed generalisation are illustrated with XLSTAT software.

Keywords

Cluster analysis

Number of clusters

AHC

Hartigan

Calinski-Harabasz 

Presenting Author

Fabien Llobell, Addirisoft, XLSTAT

First Author

Fabien Llobell, Addirisoft, XLSTAT

CoAuthor

Nour Selmi, Lumivero, XLSTAT, Paris, France

Target Audience

Mid-Level

Tracks

Computational Statistics
Symposium on Data Science and Statistics (SDSS) 2023