Print Close

Loss-Based Bayesian Clustering for Big Data using Splinters

Presented During: Advances in Loss-based, Penalized and Variational Bayesian Methods

Garritt Page Co-Author
Brigham Young University

Fernando Quintana Co-Author
Pontificia Universidad Catolica De Chile

David Dahl First Author
Brigham Young University

David Dahl Presenting Author
Brigham Young University

Wednesday, Aug 6: 11:50 AM - 12:05 PM
2055
Contributed Papers

Music City Center

We propose a Bayesian method to cluster large datasets where obtaining samples from the full posterior distribution is impractical. In Bayesian inference, an estimator is chosen by introducing a loss function and reporting the Bayes rule that minimizes its posterior expectation. Except in trivially small cases, this expectation must be approximated, typically using posterior samples. However, standard algorithms scale poorly, making it difficult to fit models with tens of thousands of items. We address the "big data" setting, where posterior sampling is infeasible, by splitting the data into overlapping subsets of manageable size for existing MCMC algorithms. The model is fit to each subset independently, generating several sets of posterior samples. Our goal is to use these samples to estimate a partition that approximates the one minimizing the full model's posterior expectation. The subset size, number of subsets, and degree of overlap are key tuning parameters, which we explore.

Keywords

Bayesian clustering

Decision theory

Variation of information loss

Binder loss

Big data

Main Sponsor

Section on Bayesian Statistical Science