Loss-Based Bayesian Clustering for Big Data using Splinters
David Dahl
Presenting Author
Brigham Young University
Wednesday, Aug 6: 11:50 AM - 12:05 PM
2055
Contributed Papers
Music City Center
We propose a Bayesian method to cluster large datasets where obtaining samples from the full posterior distribution is impractical. In Bayesian inference, an estimator is chosen by introducing a loss function and reporting the Bayes rule that minimizes its posterior expectation. Except in trivially small cases, this expectation must be approximated, typically using posterior samples. However, standard algorithms scale poorly, making it difficult to fit models with tens of thousands of items. We address the "big data" setting, where posterior sampling is infeasible, by splitting the data into overlapping subsets of manageable size for existing MCMC algorithms. The model is fit to each subset independently, generating several sets of posterior samples. Our goal is to use these samples to estimate a partition that approximates the one minimizing the full model's posterior expectation. The subset size, number of subsets, and degree of overlap are key tuning parameters, which we explore.
Bayesian clustering
Decision theory
Variation of information loss
Binder loss
Big data
Main Sponsor
Section on Bayesian Statistical Science
You have unsaved changes.