Sparse Bayesian Clustering for Bounded Data via a Multivariate Beta Mixture Model

Carmen Rodriguez Cabrera Speaker
Harvard University
 
Briana Stephenson Co-Author
Harvard T.H. Chan School of Public Health
 
Sunday, Aug 2: 2:00 PM - 3:50 PM
3361 
Contributed Speed 
We develop a Bayesian overfitted multivariate beta mixture model for clustering aggregated ecological data bounded between 0 and 1. Such data, common in social determinants of health (SDoH) research, pose challenges for standard clustering methods due to restrictive distributional assumptions and limited interpretability. The proposed model reparameterizes the multivariate beta distribution in terms of mean and concentration parameters, enabling direct interpretation of cluster-specific profiles while accommodating skewness inherent in the data. Integrated feature saliency operates on cluster means to induce sparsity by identifying variables that meaningfully drive clustering and shrinking uninformative features toward a shared mean. An overfitted mixture formulation supports data-driven inference on the number of clusters while preserving posterior uncertainty. We assess performance through simulation studies and apply the model to neighborhood-level SDoH data from the Agency for Healthcare Research and Quality, yielding interpretable ecological clusters. The framework generalizes to a broad class of bounded, aggregated multivariate data.

Keywords

Bayesian mixture model


multivariate beta distribution

sparse modeling

ecological data

feature saliency 

Main Sponsor

Section on Bayesian Statistical Science