Generalization of two well-known indices in the choice of the number of clusters within an Agglomerative Hierarchical Clustering
Conference: Symposium on Data Science and Statistics (SDSS) 2023
05/24/2023: 4:15 PM - 4:20 PM CDT
Lightning
Agglomerative Hierarchical Clustering (AHC) is a very popular statistical method that allows objects to be grouped into homogeneous clusters. This clustering method has the advantage of being able to represent its steps in a graphical form with a dendrogram.
In such a cluster analysis, the choice of the number of clusters is crucial. In practice, statistical experts often use the dendrogram to determine it. Indeed, a large gap in the dendrogram characterises two heterogeneous clusters, whereas a small gap implies that the two aggregated clusters are close.
Much has been written to help expert and non-expert users choose the correct number of clusters with many indices proposed (Charrad et al., 2014). However, the vast majority of these indices have been constructed to satisfy the most classical case: Euclidean distance and Ward's criterion. Although they perform well in this case, they become obsolete when the distance or the aggregation method changes. Indeed, depending on the type of the data (numerical, categorical) and preferences, the user can use a distance such as Canberra or an aggregation method such as single linkage while needing guidance on the number of clusters to choose. Therefore, we propose to generalise two well-known indices to any distance and aggregation method: the Hartigan index (Hartigan, 1975) and the Calinski-Harabasz index (Calinski & Harabasz, 1974). As we demonstrate, these indices can be obtained directly from the dendrogram values in the Euclidean/Ward's case. Moreover, they are related to the heterogeneity gap, which is usually interpreted graphically by experts. Thanks to these properties, we show that we can generalise these two indices by directly using the dendrogram values, regardless of the distance and aggregation method chosen by the user.
Finally, the limitations of using the two raw indices outside the Euclidean/Ward context and the benefits of the proposed generalisation are illustrated with XLSTAT software.
Cluster analysis
Number of clusters
AHC
Hartigan
Calinski-Harabasz
Presenting Author
Fabien Llobell, Addirisoft, XLSTAT
First Author
Fabien Llobell, Addirisoft, XLSTAT
CoAuthor
Nour Selmi, Lumivero, XLSTAT, Paris, France
Target Audience
Mid-Level
Tracks
Computational Statistics
Symposium on Data Science and Statistics (SDSS) 2023
You have unsaved changes.