Tuesday, Aug 5: 10:30 AM - 12:20 PM
0343
Invited Paper Session
Music City Center
Room: CC-208B
Bayesian nonparametrics
Applied
No
Main Sponsor
Section on Bayesian Statistical Science
Co Sponsors
IMS
International Society for Bayesian Analysis (ISBA)
Presentations
Collecting genomics data across multiple heterogeneous populations (e.g., across different cancer types) has the potential to improve our understanding of disease. Despite sequencing advances, though, resources often remain a constraint when gathering data. So it would be useful for experimental design if experimenters with access to a pilot study could predict the number of new variants they might expect to find in a follow-up study: both the number of new variants shared between the populations and the total across the populations. While many authors have developed prediction methods for the single-population case, we show that these predictions can fare poorly across multiple populations that are heterogeneous. We show that the Bayesian nonparametric (BNP) framework of a state-of-the-art single-population predictor facilitates a natural extension to multiple populations. However, we prove that a particularly natural choice of prior within this framework fails for fundamental reasons. By supplying an alternative BNP prior choice, we provide the first predictor for the number of new shared variants and new total variants that can handle heterogeneity in multiple populations. We show that our proposed method works well empirically using real cancer and population genetics data.
Keywords
Bayesian nonparametrics
genomics data
number of variants
beta process
Poisson point process
In many high-dimensional regression settings, it is appealing to impose low-dimensional structures on the coefficients. Additionally, clustering the coefficients helps uncover latent groups that reflect heterogeneity in the relationship between covariates and outcomes.
Clustering such high-dimensional data with low-dimensional constraints poses computational challenges, especially when using optimization methods due to the nonconvex nature of the mixture problem. While Bayesian methods offer a natural framework for sampling from the mixture model and quantifying uncertainty, specifying the prior remains difficult: spike-and-slab priors introduce computational complexity in sampling, whereas continuous shrinkage priors are ineffective at inducing the exact sparsity within mixture models. To address these challenges, we propose an optimization-driven structural sparse prior within a nonparametric Bayesian clustering approach. The hierarchical prior structure enables an efficient and straightforward Gibbs sampler. From a theoretical standpoint, we establish consistency results, both in terms of optimal parameter recovery rates and clustering accuracy. We illustrate the effectiveness of the proposed method through a compositional regression task, applying it to the analysis of GDP contributions from multiple industries across 51 states.
Keywords
Bayesian Nonparametrics
Dimension Reduction
Compositional Regression
Species sampling processes have long provided a fundamental framework for random discrete distributions and exchangeable sequences. However, analyzing data from distinct, yet related, sources, requires a broader notion of probabilistic invariance, with partial exchangeability as the natural choice. Over the past two decades, numerous models for partially exchangeable data, known as dependent nonparametric priors, have emerged, including hierarchical, nested, and additive processes. Despite their widespread use in Statistics and Machine Learning, a unifying framework remains elusive, leaving key questions about their learning mechanisms unanswered.
We fill this gap by introducing multivariate species sampling models, a general class of nonparametric priors encompassing most existing dependent nonparametric processes. These models are defined by a partially exchangeable partition probability function, encoding the induced multivariate clustering structure. We establish their core distributional properties and dependence structure, showing that borrowing of information across groups is entirely determined by shared ties. This provides new insights into their learning mechanisms, including a principled explanation for the correlation structure observed in existing models.
Beyond offering a cohesive theoretical foundation, our approach serves as a constructive tool for developing new models and opens new research directions aimed at capturing even richer dependence structures.
Keywords
Bayesian Nonparametrics
Dependent nonparametric prior
Dirichlet process
Partial exchangeability
Pitman-Yor process
Random partition
Consider a multi-population inference problem where it is of interest to estimate the mean of the population with the highest observed sample average. The usual confidence interval does not work in this case -- offering increasingly lower coverage than the nominal value when the total number of populations gets larger. This phenomenon is often referred to as the Winner's Curse. Various modifications have been proposed to adjust for the selection step. We show that interval procedures that guarantee nominal coverage conditional on the selection event typically have infinite expected length. This result motivates us to consider empirical Bayesian solutions which offer coverage guarantees only on average over some parameter subspace. Nonparametric empirical Bayesian solutions are shown to generally offer good coverage with high precision but can perform poorly when one population is very different from all others -- a clear violation of the underlying exchangeability assumption. We conclude with further mitigation strategies and discuss their frequentist and Bayesian interpretations.
Keywords
Selective inference
Winner's curse
Infinite length confidence intervals
Hierarchical Bayes
Empirical Bayes
Predictive recursion