Clav: R package and Shiny application for cluster analysis validation

Jason Bryer First Author
 
Jason Bryer Presenting Author
 
Tuesday, Aug 5: 10:35 AM - 10:50 AM
1264 
Contributed Papers 
Music City Center 
Cluster analysis is a statistical procedure for grouping observations using an observation-centered approach as compared to variable-centric approaches (e.g. PCA, factor analysis). Whether a preprocessing step for predictive modeling or the primary analysis, validation is critical for determining generalizability across datasets. Theodoridis and Koutroumbas (2008) identified three broad types of validation for cluster analysis: 1) Internal cluster validation, 2) Relative cluster validation, and 3) External cluster validation. Strategies for steps 1 and 2 are well established, however cluster analysis is typically an unsupervised learning method where there is no observed outcome. Ullman et al (2021) proposed an approach to validating a cluster solution by visually inspecting the cluster solutions across a training and validation dataset. This talk introduces the clav R package that implements and expands this approach by generating multiple random samples (using either a simple random split or bootstrap samples). Visualizations of both the cluster profiles as well as distributions of the cluster means are provided along with a Shiny application to assist the researcher.

Keywords

cluster analysis

validation

R package

Shiny application 

Main Sponsor

Section on Statistical Graphics