Unsupervised machine learning for discovery: workflow and best practices

Tarek Zikry Co-Author
 
Tiffany Tang Co-Author
University of Notre Dame
 
Genevera Allen Co-Author
 
Andersen Chang First Author
 
Tarek Zikry Presenting Author
 
Monday, Aug 4: 11:35 AM - 11:50 AM
2285 
Contributed Papers 
Music City Center 
Unsupervised learning is increasingly being used to mine large datasets to make discoveries in critical domains such as biomedicine and national security. However, there is a lack of standardized methodologies to ensure these results are reliable and interpretable. Here, we present a structured workflow for applying unsupervised learning, illustrated through an in-depth case study. We examine the classification of Milky Way stars in the APOGEE survey, applying unsupervised techniques to distinguish stellar populations and find common origins of chemical formations. Through this example, we provide guidance on data preprocessing, feature engineering, exploratory analysis, dimension reduction, validation, and iterative communication with domain experts to ensure meaningful insights. By integrating best practices in statistical analysis with real-world applications, we demonstrate how a generalizable workflow for unsupervised learning can facilitate robust data-driven discovery.

Keywords

unsupervised learning

workflow

validation

clustering

dimension reduction

statistical learning 

Main Sponsor

Section on Statistical Learning and Data Science