SPEED 4: Bayesian Analysis and Computational Methods, Part 1

Elizabeth Graff Chair
Department of Biostatistics, Harvard University
 
Monday, Aug 4: 10:30 AM - 12:20 PM
4060 
Contributed Speed 
Music City Center 
Room: CC-104B 

Presentations

A general framework for probabilistic model uncertainty

Existing approaches to model uncertainty typically either compare models using a quantitative model selection criterion or evaluate posterior model probabilities having set a prior. In this paper, we propose an alternative strategy which views missing observations as the source of model uncertainty, where the true model would be identified with the complete data. To quantify model uncertainty, it is then necessary to provide a probability distribution for the missing observations conditional on what has been observed. This can be set sequentially using one-step-ahead predictive densities, which recursively sample from the best model according to some consistent model selection criterion. Repeated predictive sampling of the missing data, to give a complete dataset and hence a best model each time, provides our measure of model uncertainty. This approach bypasses the need for subjective prior specification or integration over parameter spaces, addressing issues with standard methods such as the Bayes factor. We provide illustrations from hypothesis testing, density estimation, and variable selection, demonstrating our approach on a range of standard problems. 

Keywords

predictive inference

model uncertainty

hypothesis testing 

Co-Author(s)

Chris Holmes, University of Oxford
Stephen Walker

First Author

Vik Shirvaikar, University of Oxford

Presenting Author

Vik Shirvaikar, University of Oxford

A Scaling Neural Network-Based Approach for Variable Selection in Nonparametric Regression

Variable selection is crucial in statistical modeling, especially in high-dimensional contexts where it improves interpretability and accuracy. We propose a neural network-based approach for variable selection in nonparametric regression models, incorporating L1 penalties and custom loss functions to encourage sparsity while maintaining deep learning flexibility. Our hybrid framework uses neural networks for feature selection, efficiently managing many variables. Comparisons with Bayesian Kernel Machine Regression (BKMR) show our method handles more variables, overcoming BKMR's computational limits. Simulation studies demonstrate our method effectively selects important variables and mitigates overfitting. This approach offers a scalable solution for high-dimensional modeling, with advantages over traditional methods in complex data structures, such as in environmental science research. 

Keywords

Variable Selection

Neural Network

Nonparametric

Environmental Science 

Co-Author(s)

Ling Zhou
Peter Song, University of Michigan

First Author

Jiuchen Zhang, University of California, Irvine

Presenting Author

Jiuchen Zhang, University of California, Irvine

Bayesian Deep Gaussian Processes for Correlated Functional Data: A Case Study in Power Spectra

Understanding the structure of our universe and distribution of matter is an area of active research. As cosmological surveys grow, developing emulators to efficiently predict matter power spectra is essential. We are motivated by the Mira-Titan Universe simulation suite which, for a specified cosmological parameterization (i.e., a cosmology), provides multiple response curves of various fidelities, including correlated functional realizations. First, we estimate the underlying true matter power spectra, with appropriate uncertainty quantification (UQ), from all provided curves. We propose a Bayesian deep Gaussian process (DGP) hierarchical model which synthesizes all information and estimates the underlying matter power spectra, while providing effective UQ. Our model extends previous work on Bayesian DGPs from scalar responses to functional ones. Second, we leverage predicted power spectra from various cosmologies to accurately predict the matter power spectra for an unobserved cosmology. We leverage basis functions of the functional spectra to train a separate Gaussian process emulator. Our method performs well in synthetic exercises and against the benchmark emulator. 

Keywords

Deep Gaussian Processes

Hierarchical Modeling

Cosmology

Markov Chain Monte Carlo

Surrogate

Uncertainty Quanitification 

Co-Author(s)

Annie Booth, Virginia Tech
David Higdon, Virginia Tech
Marco Ferreira, Virginia Tech

First Author

Stephen Walsh, Elms College

Presenting Author

Stephen Walsh, Elms College

Bayesian Estimation of Population Average Causal Effects from a Collection of Trials

We propose two Bayesian mixed effects models, one linear and one linear spline, to estimate the average effect of a binary treatment on a target population via one-stage meta-analysis. In an extension of previous work in a frequentist setting, we aim to combine information from a collection of randomized trials to identify the average treatment effect (ATE) on a separate, non-study target population, by allowing study-level random effects to account for variations in outcome due to differences in studies. We examine, with simulation studies, several situations in which weight-based estimators and/or nonparametric machine learning methods face challenges in estimating a population ATE, and highlight the advantages of our parametric, outcome-based estimators. 

Keywords

meta analysis

generalizability

mixed effects models 

Co-Author(s)

Christopher Hans, The Ohio State University
Eloise Kaizar, The Ohio State University

First Author

Patrick McHugh

Presenting Author

Patrick McHugh

Bayesian latent class analysis with sparse finite mixtures

We propose the use of overfit mixture models to select the number of components for categorical data with multiple items in the response variable. Latent class analysis (LCA) requires the user to specify the number of classes in the population. Since this is often unknown to the researcher, many studies have investigated the performance of selection criteria, including hypothesis tests, likelihood-based information criteria, and cluster-based information criteria. Alternatively, for Gaussian mixtures, sparse finite mixtures allow a user to fit a model with a large number of components and learn the number of active components a posteriori. We adapt sparse finite mixtures to select the number of components for the LCA model. We provide careful recommendations for priors on the item response probabilities and component means to produce model selection for varying dimensions of the response variable. 

Keywords

Latent class analysis

Bayesian mixture model

Cluster analysis 

Co-Author

Erica Porter, Clemson University

First Author

Deborah Kunkel, Clemson University

Presenting Author

Erica Porter, Clemson University

Bayesian Magnitude and Phase Estimation of k-Space Data Enhances Reconstructed Image Quality

It is well known in fMRI studies that the first three images in a time series have a much higher signal than the remainder of the time series. Many attempts to decrease noise and improve activation are applied on image data, where voxels may be spatially correlated. This requires the use of spatial modeling and often results in blurrier images. A Bayesian approach is employed on the uncorrelated k-space magnitude and phase data since the spatial frequency coefficients can be treated independently of each other. This method quantifies available a priori information about spatial frequency coefficients from the first three images in the time series. The spatial frequencies observed throughout the fMRI experiment are then incorporated and spatial frequency coefficients are estimated a posteriori using both the ICM algorithm and Gibbs sampling. Discrete inverse Fourier transform reconstructed images from posterior estimated spatial frequencies will have reduced noise and increased detection power. 

Keywords

Bayesian

fMRI

k-space 

Co-Author

Dan Rowe, Marquette University

First Author

John Bodenschatz

Presenting Author

John Bodenschatz

Bayesian Semiparametric Estimation of Optimal Treatment Regimes with Irregularly Observed Data

In statistical precision medicine, optimal dynamic treatment regimes (DTR) are sequences of decision rules assigning treatment tailored to patients' observed history, maximizing long-term patient outcome, referred to as the reward. While Bayesian estimation of reward under different treatment regimes often require explicit specification of all model components, non-tailoring confounders can instead be accounted for using inverse weighting through maximizing the utility, defined as the log-likelihood marginalized over these confounders. DTR estimation methods focus on correcting for confounding under observed longitudinal data. However, they can be also irregular, where visit times can be driven by patient history; failure to account for this irregularity can induce bias in reward and optimal DTR estimates. In this work, we extend existing weighting approaches for DTR estimation within the Bayesian paradigm using irregularly observed data. We showcase through simulation studies that we can estimate of rewards and optimal regimes without bias by using a double weighting approach, with inverse weighting terms to control both confounding and visit irregularity. 

Keywords

Dynamic treatment regimes

irregularly observed data

inverse weighting

Bayesian semiparametric estimation

statistical precision medicine 

Co-Author(s)

Eleanor Pullenayegum, Hospital for Sick Children
Olli Saarela, University of Toronto

First Author

Larry Dong

Presenting Author

Larry Dong

Bayesian Small Area Estimation of Proportions Using Archimedean Copula Models with Random Effects

This study addresses the challenge of estimating proportions for small areas, particularly when sample sizes are small or even zero, leading to unreliable direct estimates. We propose a Bayesian approach using a new probability model class that incorporates random effects to account for variations between small areas. These models are based on exchangeable Archimedean copulas, allowing for flexible extra binomial variation modeling. The Bayesian inference involves obtaining the posterior distribution of the random effect and its Laplace transform, which is then used to derive Bayes estimates of small area proportions. Model parameters are estimated using maximum likelihood, and the Akaike information criterion (AIC) is used for model selection. We also develop empirical best predictors (EBP) and empirical best linear unbiased predictors (EBLUP) for small area proportions and propose a jackknife method to estimate the prediction mean squared error (PMSE) of these predictors. The methods are illustrated through simulated and real data examples, demonstrating their effectiveness in providing reliable small-area estimates. This work contributes to the growing literature on small-area e 

Keywords

Small Area Estimation

Bayesian Inference

Archimedean Copulas, Random Effects

Empirical Best Predictor (EBP)

Maximum Likelihood Estimation

Jackknife Method 

First Author

Fode Tounkara

Presenting Author

Fode Tounkara

WITHDRAWN Eliciting some generalisations of Dirichlet distribution as priors for Multinomial models

The Dirichlet distribution, central to probability theory and statistical modeling, is a cornerstone for multinomial sampling models. Its ability to simulate proportions across categories makes it indispensable in fields like biology and communication theory. However, as our understanding of complex systems advances, there is a need to extend and generalise models to better capture real-world intricacies. Hence, researchers have introduced some generalisations of the Standard Dirichlet such as the Dirichlet Type 3, Scaled Dirichlet, Shifted-scaled Dirichlet, Flexible Dirichlet and Extended Flexible Dirichlet distributions. These models provide greater flexibility in capturing dependencies and expert beliefs in elicitation contexts. While eliciting the standard Dirichlet distribution is well documented, methods for eliciting these more flexible models are relatively new. To support this, we have developed methods and a Shiny R application that enables experts to input and visualise their beliefs using these generalisations. This tool offers a practical approach to applying these advanced models, that supports the elicitation process and broadens their use in applied research. 

Keywords

Bayesian Statistics

Prior Elicitation

Expert Elicitation

Dirichlet Distribution

Shiny R 

Co-Author

Fadlalla Elfadaly, The Open University, Walton Hall, Milton Keynes, UK

First Author

Nayana Unnipillai

From Naive Bayes to Large Language Models: A New Era of Text-Based Classification

Text classification has been central to data science, with models like Naive Bayes serving as foundational tools. While effective for short, structured data, traditional classifiers struggled with long-form text due to methods like TF-IDF, which failed to capture context and long-term dependencies. Tasks like sentiment analysis of reports or categorizing legal documents often led to inefficiencies. Large language models (LLMs) have transformed this landscape. Transformer architectures excel at understanding context within text, enabling accurate classification for complex tasks such as identifying topics in policy documents or classifying unstructured medical records. Further, smaller, fine-tuned LLMs like BERT provide scalable, cost-effective solutions without sacrificing performance compared to larger general-purpose LLMs like GPT-4o which involve higher costs for both training and inference. This talk focuses on how such smaller LLMs can be utilized to solve classification problems at a low cost with high accuracy. It'll highlight strategies for deploying such models efficiently so that they empower data scientists and statisticians to tackle challenges in modern text analytics. 

Keywords

Text classification

Large language models (LLMs)

Transformer architectures

BERT

Cost-effective solutions


Modern text analytics 

Co-Author

Rajat Verma

First Author

Abdul Wasay, Autodesk

Presenting Author

Abdul Wasay, Autodesk

Generalized Bayesian Inference for Dynamic Random Dot Product Graphs

The random dot product graph (RDPG) is a popular model for network data with extensions that accommodate dynamic (time-varying) networks. However, two significant deficiencies exist in the dynamic RDPG literature: (1) no coherent Bayesian way to update one's prior beliefs about the model parameters due to their complicated constraints, and (2) no approach to forecast future networks with meaningful uncertainty quantification. This work proposes a generalized Bayesian framework that addresses these needs using a Gibbs posterior that represents a coherent updating of Bayesian beliefs based on a least-squares loss function. Furthermore, we establish the consistency and contraction rate of this Gibbs posterior under commonly adopted Gaussian random walk priors. For estimation, we develop a fast Gibbs sampler with a time complexity that is linear in both the number of time points and observed edges in the dynamic network. Simulations and real data analyses show that the proposed method's in-sample and forecasting performance outperforms that of competitors. 

Keywords

Gibbs posterior

Dynamic network data

Network latent space models

statistical network analysis

Bayesian inference

Forecasting 

First Author

Joshua Loyal, Florida State University

Presenting Author

Joshua Loyal, Florida State University

Marginal Likelihood Estimation in Bayesian Item Response Theory Models

Item Response Theory (IRT) models are widely used in psychometrics to measure latent traits like ability from test responses. Standard IRT models assume a fixed trait distribution, which may not capture population differences. To address this, Bayesian nonparametric (BNP) IRT models use priors such as the Chinese Restaurant Process (CRP) to allow data-driven clustering of individuals. While this increases flexibility, it also adds computational complexity, making accurate marginal likelihood estimation crucial for comparing BNP and parametric models using Bayes factors, especially in high-dimensional settings. Bridge sampling provides a more stable alternative to traditional Monte Carlo methods but must be adapted to handle the discrete clustering structure of BNP models.

This work develops a two-step method for marginal likelihood estimation in BNP IRT models. First, latent traits are integrated out using the model's structure, reducing computation. Second, bridge sampling is refined, incorporating moment-matching and variance reduction techniques to improve accuracy. Simulation results show that this method enhances efficiency and precision. 

Keywords

Bridge Sampling

Bayes Factor

Hierarchical Models

Latent Variables

Monte Carlo Methods

Nonparametric Clustering 

Co-Author

Sally Paganin, The Ohio State University

First Author

Alex Nguyen

Presenting Author

Alex Nguyen

Missing data imputation via truncated Gaussian factor analysis with application to metabolomics data

In metabolomics, which involves the study of small molecules in biological samples, data are often acquired via mass spectrometry, resulting in high-dimensional, highly correlated datasets with frequent missing values. Both missing at random (MAR), due to acquisition or processing errors, and missing not at random (MNAR), often caused by values falling below detection thresholds, are common. Imputation is thus a critical component of downstream analysis. We propose a novel Truncated Gaussian Infinite Factor Analysis (TGIFA) model to address these challenges. By incorporating truncated Gaussian assumptions, TGIFA respects the physical constraints of the data, while the use of an infinite latent factor framework eliminates the need to pre-specify the number of factors. Our Bayesian inference approach jointly models MAR and MNAR mechanisms and, via a computationally efficient exchange algorithm, provides posterior uncertainty quantification for both imputed values and missingness types. We evaluate TGIFA through extensive simulation studies and apply it to a urinary metabolomics dataset, where it yields sensible and interpretable imputations with associated uncertainty estimates. 

Keywords

Missing data

Metabolomics

Imputation

Infinite factor model

Mass spectrometry data 

Co-Author(s)

Lorraine Brennan, University College Dublin
Roberta De Vito, Brown University
Massimiliano Russo, The Ohio State University
Isobel Claire Gormley, University College Dublin

First Author

Kate Finucane, University College Dublin

Presenting Author

Kate Finucane, University College Dublin

Modeling Binary Data with Time-Dependent Covariates: A Two-Stage Logistic Regression Approach

A two-stage logistic regression framework tailored for analyzing longitudinal binary data with time-dependent covariates. The model incorporates Bayesian priors and random effects to address feedback loops, correlations from repeated measurements, and the complexities of evolving covariates in hierarchical contexts. By partitioning covariates into time-dependent and time-independent components, the framework effectively handles unequally spaced observations and missing-at-random data. Generalized Method of Moments is used to identify valid instruments, distinguishing between valid and invalid moment conditions. Parameter estimation is conducted via Markov Chain Monte Carlo (MCMC) techniques, ensuring consistent and asymptotically normal estimates. The approach is validated by simulation studies and applied to medical data, highlighting its utility in capturing dynamic predictor-outcome relationships. This model is relevant for fields like medical research, public health, and behavioral sciences, where dynamic processes play a critical role. The proposed framework is capable of managing highly correlated data and reducing biases typically seen in traditional methods. 

Keywords

Longitudinal Binary Data

Two-stage Logistic Regression

Time-Dependent Covariates

Bayesian Priors

Random Effects Models

Hierarchical Models 

Co-Author(s)

Ruoqian Liu, Arizona State University
Jeffrey Wilson, Arizona State University

First Author

Lori Selby, Arizona State University

Presenting Author

Lori Selby, Arizona State University

Praxis-BGM: Prior-Augmented and Regularized Natural-Gradient Variational Inference with Gaussian Mixture Models for Transfer Learning

Recent advances in high-throughput omics technologies have enabled observational studies to collect multiple omics layers on the same individuals. However, high-dimension data and often low-sample-sizes (HDLSS) pose significant challenges for model-based clustering approaches such as Gaussian Mixture Models (GMMs). While existing methods for integrative multi-omics clustering account for variations within and across omic layers, they often fall short in addressing the HDLSS issue, where complex mixture patterns become difficult to generalize in a small number of subjects and model instability increases as model complexity increases.
Statistical transfer learning has emerged as a powerful approach to address HDLSS by leveraging knowledge from related but distinct source domains to improve modeling in the target domain. Among various strategies, incorporating informative priors from the source domain within a Bayesian framework offers a natural and effective solution. In addition, modern Bayesian methods offer scalable and efficient computation for high-dimensional data. For example, natural-gradient variational inference turns Bayesian inference into an optimization problem and leverages the underlying geometry of the parameter space to achieve fast yet good posterior approximation.
We introduce Praxis-BGM, a natural-gradient variational inference method for GMMs that flexibly incorporates cluster-specific priors—including means, covariance matrices, and structural pathway information—to constrain posterior estimation and facilitate knowledge transfer. These various prior components can be used individually or in combination depending on their availability. They can be obtained from large source datasets or reference atlases. For estimation, we optimize the variational covariance matrices in the Cholesky decomposition space, which ensures positive definiteness, enhances numerical stability, and reduces the number of free parameters for efficiency. We derive natural-gradient updates that incorporate prior knowledge and propose a clustering-driven feature selection procedure based on Bayes Factors. Praxis-BGM is implemented using the JAX library for accelerator-oriented computation with high efficiency and scalability.
We demonstrate the effectiveness of Praxis-BGM through extensive simulations, evaluating the contribution of each component of the informative prior, and assessing the method's ability to overcome less accurate priors while balancing inconsistencies between the priors and observed data. We also demonstrate the application of Praxis-BGM to three applied analyses: 1) two similar COVID-19 metabolomics datasets; 2) two breast cancer transcriptomic datasets from The Cancer Genome Atlas (TCGA) and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC); and 3) a single-cell RNA sequencing data set using atlas reference data as the prior. The first two applications highlight how priors for meaningful cluster structures derived from one source study can enhance clustering performance in another, improving both biological interpretability and clinical relevance. In the third example, mean priors derived from labeled cell types in a reference atlas are used to predefine and annotate clusters in the observed data, guiding the estimation process. 

Keywords

Multi-omics

Bayesian Clustering

Mixture Model

Variational Inference

High-Dimensional Data

Dimension Reduction 

Co-Author(s)

Jesse Goodrich, University of Southern California
David Conti, University of Southern California

First Author

Qiran Jia, University of Southern California

Presenting Author

Qiran Jia, University of Southern California

Quantile Slice Sampling

We propose and demonstrate a novel, effective approach to simple slice sampling. Using the probability integral transform, we first generalize Neal's shrinkage algorithm, standardizing the procedure to an automatic and universal starting point: the unit interval. This enables the introduction of approximate (pseudo-) targets through a factorization used in importance sampling, a technique that has popularized elliptical slice sampling. Reasonably accurate pseudo-targets can boost sampler efficiency by requiring fewer rejections and by reducing target skewness. This strategy is effective when a natural, possibly crude approximation to the target exists. Alternatively, obtaining a marginal pseudo-target from initial samples provides an intuitive and automatic tuning procedure. We consider pseudo-target specification and interpretable diagnostics. We examine performance of the proposed sampler relative to other popular, easily implemented MCMC samplers on standard targets in isolation, and as steps within a Gibbs sampler in a Bayesian modeling context. We extend to multivariate slice samplers and demonstrate with a constrained state-space model. R package qslice is available on CRAN. 

Keywords

Markov chain Monte Carlo

Hybrid slice sampling

Bayesian computation 

Co-Author(s)

Samuel Johnson
Joshua Christensen, Pentara Corporation
David Dahl, Brigham Young University

First Author

Matthew Heiner, Brigham Young University

Presenting Author

Matthew Heiner, Brigham Young University

Sampling from ranking models via a hit and run approach

The Mallows permutation model is a prominent model for ranking data, which are prevalent in diverse fields such as recommender systems, psychology, and electoral studies. This model specifies a family of non-uniform probability distributions on permutations, defined via a distance metric on permutations. We focus on two common choices: the L1 distance (Spearman's footrule) and the L2 distance (Spearman's rank correlation). Despite their popularity, these models present a significant computational challenge due to the intractability of their normalizing constants, hindering off-the-shelf sampling and inference methods. We develop and analyze hit and run Markov chain Monte Carlo algorithms for sampling from these Mallows models. For both models, we establish order log(n) mixing time upper bounds, providing the first theoretical guarantees for efficient sampling. The convergence analysis employs novel couplings for permutations with one-sided restrictions and leverages the path coupling technique. These advancements enable efficient Monte Carlo maximum likelihood estimation, facilitating scalable inference for ranking data with statistical guarantees. 

Keywords

Mallows permutation model

ranking data

hit and run algorithm

Markov chain Monte Carlo

mixing time

scalable inference 

First Author

Chenyang Zhong, Department of Statistics, Columbia University

Presenting Author

Chenyang Zhong, Department of Statistics, Columbia University