Print Close

Unsupervised machine learning for discovery: workflow and best practices

Presented During: Machine Learning for Spatiotemporal Data

Tarek Zikry Co-Author

Tiffany Tang Co-Author
University of Notre Dame

Genevera Allen Co-Author

Andersen Chang First Author

Tarek Zikry Presenting Author

Monday, Aug 4: 11:35 AM - 11:50 AM
2285
Contributed Papers

Music City Center

Unsupervised learning is increasingly being used to mine large datasets to make discoveries in critical domains such as biomedicine and national security. However, there is a lack of standardized methodologies to ensure these results are reliable and interpretable. Here, we present a structured workflow for applying unsupervised learning, illustrated through an in-depth case study. We examine the classification of Milky Way stars in the APOGEE survey, applying unsupervised techniques to distinguish stellar populations and find common origins of chemical formations. Through this example, we provide guidance on data preprocessing, feature engineering, exploratory analysis, dimension reduction, validation, and iterative communication with domain experts to ensure meaningful insights. By integrating best practices in statistical analysis with real-world applications, we demonstrate how a generalizable workflow for unsupervised learning can facilitate robust data-driven discovery.

Keywords

unsupervised learning

workflow

validation

clustering

dimension reduction

statistical learning

Main Sponsor

Section on Statistical Learning and Data Science