EMS Coreset: An efficient Expectation-Maximization algorithm for Sinkhorn Coreset

Haoyun Yin Speaker
 
Chuanhui Liu Co-Author
 
Xiao Wang Co-Author
Purdue University
 
Thursday, Aug 6: 8:30 AM - 10:20 AM
1880 
Contributed Papers 
Thomas M. Menino Convention & Exhibition Center 
Coresets distill large datasets into small, representative subsets for efficient downstream learning. Yet Optimal Transport (OT)–based selection typically requires intensive computation of transport plans, limiting scalability. We introduce a scalable Sinkhorn coreset method that permits closed-form updates of the entropically regularized OT coupling by allowing non-uniform coreset weights. This produces centroids that generalize k-means via soft assignments. We establish asymptotic consistency of the selected measure and Lipschitz stability to data perturbations, providing accuracy and robustness guarantees. Across synthetic and real-world benchmarks, the proposed method achieves competitive or improved approximation quality while substantially reducing runtime compared to Wasserstein- and standard Sinkhorn-based coreset selection, especially at large scale.

Keywords

Coreset

Optimal Transport

Data Distillation

Sinkhorn Loss

EM-algorithm 

Main Sponsor

Section on Statistical Learning and Data Science