Loss-Based Bayesian Clustering for Big Data using Splinters

Garritt Page Co-Author
Brigham Young University
 
Fernando Quintana Co-Author
Pontificia Universidad Catolica De Chile
 
David Dahl First Author
Brigham Young University
 
David Dahl Presenting Author
Brigham Young University
 
Wednesday, Aug 6: 11:50 AM - 12:05 PM
2055 
Contributed Papers 
Music City Center 
We propose a Bayesian method to cluster large datasets where obtaining samples from the full posterior distribution is impractical. In Bayesian inference, an estimator is chosen by introducing a loss function and reporting the Bayes rule that minimizes its posterior expectation. Except in trivially small cases, this expectation must be approximated, typically using posterior samples. However, standard algorithms scale poorly, making it difficult to fit models with tens of thousands of items. We address the "big data" setting, where posterior sampling is infeasible, by splitting the data into overlapping subsets of manageable size for existing MCMC algorithms. The model is fit to each subset independently, generating several sets of posterior samples. Our goal is to use these samples to estimate a partition that approximates the one minimizing the full model's posterior expectation. The subset size, number of subsets, and degree of overlap are key tuning parameters, which we explore.

Keywords

Bayesian clustering

Decision theory

Variation of information loss

Binder loss

Big data 

Main Sponsor

Section on Bayesian Statistical Science