Monday, Aug 4: 10:30 AM - 12:20 PM
4060
Contributed Speed
Music City Center
Room: CC-104B
Presentations
Existing approaches to model uncertainty typically either compare models using a quantitative model selection criterion or evaluate posterior model probabilities having set a prior. In this paper, we propose an alternative strategy which views missing observations as the source of model uncertainty, where the true model would be identified with the complete data. To quantify model uncertainty, it is then necessary to provide a probability distribution for the missing observations conditional on what has been observed. This can be set sequentially using one-step-ahead predictive densities, which recursively sample from the best model according to some consistent model selection criterion. Repeated predictive sampling of the missing data, to give a complete dataset and hence a best model each time, provides our measure of model uncertainty. This approach bypasses the need for subjective prior specification or integration over parameter spaces, addressing issues with standard methods such as the Bayes factor. We provide illustrations from hypothesis testing, density estimation, and variable selection, demonstrating our approach on a range of standard problems.
Keywords
predictive inference
model uncertainty
hypothesis testing
Variable selection is crucial in statistical modeling, especially in high-dimensional contexts where it improves interpretability and accuracy. We propose a neural network-based approach for variable selection in nonparametric regression models, incorporating L1 penalties and custom loss functions to encourage sparsity while maintaining deep learning flexibility. Our hybrid framework uses neural networks for feature selection, efficiently managing many variables. Comparisons with Bayesian Kernel Machine Regression (BKMR) show our method handles more variables, overcoming BKMR's computational limits. Simulation studies demonstrate our method effectively selects important variables and mitigates overfitting. This approach offers a scalable solution for high-dimensional modeling, with advantages over traditional methods in complex data structures, such as in environmental science research.
Keywords
Variable Selection
Neural Network
Nonparametric
Environmental Science
Understanding the structure of our universe and distribution of matter is an area of active research. As cosmological surveys grow, developing emulators to efficiently predict matter power spectra is essential. We are motivated by the Mira-Titan Universe simulation suite which, for a specified cosmological parameterization (i.e., a cosmology), provides multiple response curves of various fidelities, including correlated functional realizations. First, we estimate the underlying true matter power spectra, with appropriate uncertainty quantification (UQ), from all provided curves. We propose a Bayesian deep Gaussian process (DGP) hierarchical model which synthesizes all information and estimates the underlying matter power spectra, while providing effective UQ. Our model extends previous work on Bayesian DGPs from scalar responses to functional ones. Second, we leverage predicted power spectra from various cosmologies to accurately predict the matter power spectra for an unobserved cosmology. We leverage basis functions of the functional spectra to train a separate Gaussian process emulator. Our method performs well in synthetic exercises and against the benchmark emulator.
Keywords
Deep Gaussian Processes
Hierarchical Modeling
Cosmology
Markov Chain Monte Carlo
Surrogate
Uncertainty Quanitification
We propose two Bayesian mixed effects models, one linear and one linear spline, to estimate the average effect of a binary treatment on a target population via one-stage meta-analysis. In an extension of previous work in a frequentist setting, we aim to combine information from a collection of randomized trials to identify the average treatment effect (ATE) on a separate, non-study target population, by allowing study-level random effects to account for variations in outcome due to differences in studies. We examine, with simulation studies, several situations in which weight-based estimators and/or nonparametric machine learning methods face challenges in estimating a population ATE, and highlight the advantages of our parametric, outcome-based estimators.
Keywords
meta analysis
generalizability
mixed effects models
We propose the use of overfit mixture models to select the number of components for categorical data with multiple items in the response variable. Latent class analysis (LCA) requires the user to specify the number of classes in the population. Since this is often unknown to the researcher, many studies have investigated the performance of selection criteria, including hypothesis tests, likelihood-based information criteria, and cluster-based information criteria. Alternatively, for Gaussian mixtures, sparse finite mixtures allow a user to fit a model with a large number of components and learn the number of active components a posteriori. We adapt sparse finite mixtures to select the number of components for the LCA model. We provide careful recommendations for priors on the item response probabilities and component means to produce model selection for varying dimensions of the response variable.
Keywords
Latent class analysis
Bayesian mixture model
Cluster analysis
It is well known in fMRI studies that the first three images in a time series have a much higher signal than the remainder of the time series. Many attempts to decrease noise and improve activation are applied on image data, where voxels may be spatially correlated. This requires the use of spatial modeling and often results in blurrier images. A Bayesian approach is employed on the uncorrelated k-space magnitude and phase data since the spatial frequency coefficients can be treated independently of each other. This method quantifies available a priori information about spatial frequency coefficients from the first three images in the time series. The spatial frequencies observed throughout the fMRI experiment are then incorporated and spatial frequency coefficients are estimated a posteriori using both the ICM algorithm and Gibbs sampling. Discrete inverse Fourier transform reconstructed images from posterior estimated spatial frequencies will have reduced noise and increased detection power.
Keywords
Bayesian
fMRI
k-space
In statistical precision medicine, optimal dynamic treatment regimes (DTR) are sequences of decision rules assigning treatment tailored to patients' observed history, maximizing long-term patient outcome, referred to as the reward. While Bayesian estimation of reward under different treatment regimes often require explicit specification of all model components, non-tailoring confounders can instead be accounted for using inverse weighting through maximizing the utility, defined as the log-likelihood marginalized over these confounders. DTR estimation methods focus on correcting for confounding under observed longitudinal data. However, they can be also irregular, where visit times can be driven by patient history; failure to account for this irregularity can induce bias in reward and optimal DTR estimates. In this work, we extend existing weighting approaches for DTR estimation within the Bayesian paradigm using irregularly observed data. We showcase through simulation studies that we can estimate of rewards and optimal regimes without bias by using a double weighting approach, with inverse weighting terms to control both confounding and visit irregularity.
Keywords
Dynamic treatment regimes
irregularly observed data
inverse weighting
Bayesian semiparametric estimation
statistical precision medicine
This study addresses the challenge of estimating proportions for small areas, particularly when sample sizes are small or even zero, leading to unreliable direct estimates. We propose a Bayesian approach using a new probability model class that incorporates random effects to account for variations between small areas. These models are based on exchangeable Archimedean copulas, allowing for flexible extra binomial variation modeling. The Bayesian inference involves obtaining the posterior distribution of the random effect and its Laplace transform, which is then used to derive Bayes estimates of small area proportions. Model parameters are estimated using maximum likelihood, and the Akaike information criterion (AIC) is used for model selection. We also develop empirical best predictors (EBP) and empirical best linear unbiased predictors (EBLUP) for small area proportions and propose a jackknife method to estimate the prediction mean squared error (PMSE) of these predictors. The methods are illustrated through simulated and real data examples, demonstrating their effectiveness in providing reliable small-area estimates. This work contributes to the growing literature on small-area e
Keywords
Small Area Estimation
Bayesian Inference
Archimedean Copulas, Random Effects
Empirical Best Predictor (EBP)
Maximum Likelihood Estimation
Jackknife Method
The Dirichlet distribution, central to probability theory and statistical modeling, is a cornerstone for multinomial sampling models. Its ability to simulate proportions across categories makes it indispensable in fields like biology and communication theory. However, as our understanding of complex systems advances, there is a need to extend and generalise models to better capture real-world intricacies. Hence, researchers have introduced some generalisations of the Standard Dirichlet such as the Dirichlet Type 3, Scaled Dirichlet, Shifted-scaled Dirichlet, Flexible Dirichlet and Extended Flexible Dirichlet distributions. These models provide greater flexibility in capturing dependencies and expert beliefs in elicitation contexts. While eliciting the standard Dirichlet distribution is well documented, methods for eliciting these more flexible models are relatively new. To support this, we have developed methods and a Shiny R application that enables experts to input and visualise their beliefs using these generalisations. This tool offers a practical approach to applying these advanced models, that supports the elicitation process and broadens their use in applied research.
Keywords
Bayesian Statistics
Prior Elicitation
Expert Elicitation
Dirichlet Distribution
Shiny R
Text classification has been central to data science, with models like Naive Bayes serving as foundational tools. While effective for short, structured data, traditional classifiers struggled with long-form text due to methods like TF-IDF, which failed to capture context and long-term dependencies. Tasks like sentiment analysis of reports or categorizing legal documents often led to inefficiencies. Large language models (LLMs) have transformed this landscape. Transformer architectures excel at understanding context within text, enabling accurate classification for complex tasks such as identifying topics in policy documents or classifying unstructured medical records. Further, smaller, fine-tuned LLMs like BERT provide scalable, cost-effective solutions without sacrificing performance compared to larger general-purpose LLMs like GPT-4o which involve higher costs for both training and inference. This talk focuses on how such smaller LLMs can be utilized to solve classification problems at a low cost with high accuracy. It'll highlight strategies for deploying such models efficiently so that they empower data scientists and statisticians to tackle challenges in modern text analytics.
Keywords
Text classification
Large language models (LLMs)
Transformer architectures
BERT
Cost-effective solutions
Modern text analytics
The random dot product graph (RDPG) is a popular model for network data with extensions that accommodate dynamic (time-varying) networks. However, two significant deficiencies exist in the dynamic RDPG literature: (1) no coherent Bayesian way to update one's prior beliefs about the model parameters due to their complicated constraints, and (2) no approach to forecast future networks with meaningful uncertainty quantification. This work proposes a generalized Bayesian framework that addresses these needs using a Gibbs posterior that represents a coherent updating of Bayesian beliefs based on a least-squares loss function. Furthermore, we establish the consistency and contraction rate of this Gibbs posterior under commonly adopted Gaussian random walk priors. For estimation, we develop a fast Gibbs sampler with a time complexity that is linear in both the number of time points and observed edges in the dynamic network. Simulations and real data analyses show that the proposed method's in-sample and forecasting performance outperforms that of competitors.
Keywords
Gibbs posterior
Dynamic network data
Network latent space models
statistical network analysis
Bayesian inference
Forecasting
Item Response Theory (IRT) models are widely used in psychometrics to measure latent traits like ability from test responses. Standard IRT models assume a fixed trait distribution, which may not capture population differences. To address this, Bayesian nonparametric (BNP) IRT models use priors such as the Chinese Restaurant Process (CRP) to allow data-driven clustering of individuals. While this increases flexibility, it also adds computational complexity, making accurate marginal likelihood estimation crucial for comparing BNP and parametric models using Bayes factors, especially in high-dimensional settings. Bridge sampling provides a more stable alternative to traditional Monte Carlo methods but must be adapted to handle the discrete clustering structure of BNP models.
This work develops a two-step method for marginal likelihood estimation in BNP IRT models. First, latent traits are integrated out using the model's structure, reducing computation. Second, bridge sampling is refined, incorporating moment-matching and variance reduction techniques to improve accuracy. Simulation results show that this method enhances efficiency and precision.
Keywords
Bridge Sampling
Bayes Factor
Hierarchical Models
Latent Variables
Monte Carlo Methods
Nonparametric Clustering
In metabolomics, which involves the study of small molecules in biological samples, data are often acquired via mass spectrometry, resulting in high-dimensional, highly correlated datasets with frequent missing values. Both missing at random (MAR), due to acquisition or processing errors, and missing not at random (MNAR), often caused by values falling below detection thresholds, are common. Imputation is thus a critical component of downstream analysis. We propose a novel Truncated Gaussian Infinite Factor Analysis (TGIFA) model to address these challenges. By incorporating truncated Gaussian assumptions, TGIFA respects the physical constraints of the data, while the use of an infinite latent factor framework eliminates the need to pre-specify the number of factors. Our Bayesian inference approach jointly models MAR and MNAR mechanisms and, via a computationally efficient exchange algorithm, provides posterior uncertainty quantification for both imputed values and missingness types. We evaluate TGIFA through extensive simulation studies and apply it to a urinary metabolomics dataset, where it yields sensible and interpretable imputations with associated uncertainty estimates.
Keywords
Missing data
Metabolomics
Imputation
Infinite factor model
Mass spectrometry data
A two-stage logistic regression framework tailored for analyzing longitudinal binary data with time-dependent covariates. The model incorporates Bayesian priors and random effects to address feedback loops, correlations from repeated measurements, and the complexities of evolving covariates in hierarchical contexts. By partitioning covariates into time-dependent and time-independent components, the framework effectively handles unequally spaced observations and missing-at-random data. Generalized Method of Moments is used to identify valid instruments, distinguishing between valid and invalid moment conditions. Parameter estimation is conducted via Markov Chain Monte Carlo (MCMC) techniques, ensuring consistent and asymptotically normal estimates. The approach is validated by simulation studies and applied to medical data, highlighting its utility in capturing dynamic predictor-outcome relationships. This model is relevant for fields like medical research, public health, and behavioral sciences, where dynamic processes play a critical role. The proposed framework is capable of managing highly correlated data and reducing biases typically seen in traditional methods.
Keywords
Longitudinal Binary Data
Two-stage Logistic Regression
Time-Dependent Covariates
Bayesian Priors
Random Effects Models
Hierarchical Models
Recent advances in high-throughput omics technologies have enabled observational studies to collect multiple omics layers on the same individuals. However, high-dimension data and often low-sample-sizes (HDLSS) pose significant challenges for model-based clustering approaches such as Gaussian Mixture Models (GMMs). While existing methods for integrative multi-omics clustering account for variations within and across omic layers, they often fall short in addressing the HDLSS issue, where complex mixture patterns become difficult to generalize in a small number of subjects and model instability increases as model complexity increases.
Statistical transfer learning has emerged as a powerful approach to address HDLSS by leveraging knowledge from related but distinct source domains to improve modeling in the target domain. Among various strategies, incorporating informative priors from the source domain within a Bayesian framework offers a natural and effective solution. In addition, modern Bayesian methods offer scalable and efficient computation for high-dimensional data. For example, natural-gradient variational inference turns Bayesian inference into an optimization problem and leverages the underlying geometry of the parameter space to achieve fast yet good posterior approximation.
We introduce Praxis-BGM, a natural-gradient variational inference method for GMMs that flexibly incorporates cluster-specific priors—including means, covariance matrices, and structural pathway information—to constrain posterior estimation and facilitate knowledge transfer. These various prior components can be used individually or in combination depending on their availability. They can be obtained from large source datasets or reference atlases. For estimation, we optimize the variational covariance matrices in the Cholesky decomposition space, which ensures positive definiteness, enhances numerical stability, and reduces the number of free parameters for efficiency. We derive natural-gradient updates that incorporate prior knowledge and propose a clustering-driven feature selection procedure based on Bayes Factors. Praxis-BGM is implemented using the JAX library for accelerator-oriented computation with high efficiency and scalability.
We demonstrate the effectiveness of Praxis-BGM through extensive simulations, evaluating the contribution of each component of the informative prior, and assessing the method's ability to overcome less accurate priors while balancing inconsistencies between the priors and observed data. We also demonstrate the application of Praxis-BGM to three applied analyses: 1) two similar COVID-19 metabolomics datasets; 2) two breast cancer transcriptomic datasets from The Cancer Genome Atlas (TCGA) and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC); and 3) a single-cell RNA sequencing data set using atlas reference data as the prior. The first two applications highlight how priors for meaningful cluster structures derived from one source study can enhance clustering performance in another, improving both biological interpretability and clinical relevance. In the third example, mean priors derived from labeled cell types in a reference atlas are used to predefine and annotate clusters in the observed data, guiding the estimation process.
Keywords
Multi-omics
Bayesian Clustering
Mixture Model
Variational Inference
High-Dimensional Data
Dimension Reduction
Co-Author(s)
Jesse Goodrich, University of Southern California
David Conti, University of Southern California
First Author
Qiran Jia, University of Southern California
Presenting Author
Qiran Jia, University of Southern California
We propose and demonstrate a novel, effective approach to simple slice sampling. Using the probability integral transform, we first generalize Neal's shrinkage algorithm, standardizing the procedure to an automatic and universal starting point: the unit interval. This enables the introduction of approximate (pseudo-) targets through a factorization used in importance sampling, a technique that has popularized elliptical slice sampling. Reasonably accurate pseudo-targets can boost sampler efficiency by requiring fewer rejections and by reducing target skewness. This strategy is effective when a natural, possibly crude approximation to the target exists. Alternatively, obtaining a marginal pseudo-target from initial samples provides an intuitive and automatic tuning procedure. We consider pseudo-target specification and interpretable diagnostics. We examine performance of the proposed sampler relative to other popular, easily implemented MCMC samplers on standard targets in isolation, and as steps within a Gibbs sampler in a Bayesian modeling context. We extend to multivariate slice samplers and demonstrate with a constrained state-space model. R package qslice is available on CRAN.
Keywords
Markov chain Monte Carlo
Hybrid slice sampling
Bayesian computation
The Mallows permutation model is a prominent model for ranking data, which are prevalent in diverse fields such as recommender systems, psychology, and electoral studies. This model specifies a family of non-uniform probability distributions on permutations, defined via a distance metric on permutations. We focus on two common choices: the L1 distance (Spearman's footrule) and the L2 distance (Spearman's rank correlation). Despite their popularity, these models present a significant computational challenge due to the intractability of their normalizing constants, hindering off-the-shelf sampling and inference methods. We develop and analyze hit and run Markov chain Monte Carlo algorithms for sampling from these Mallows models. For both models, we establish order log(n) mixing time upper bounds, providing the first theoretical guarantees for efficient sampling. The convergence analysis employs novel couplings for permutations with one-sided restrictions and leverages the path coupling technique. These advancements enable efficient Monte Carlo maximum likelihood estimation, facilitating scalable inference for ranking data with statistical guarantees.
Keywords
Mallows permutation model
ranking data
hit and run algorithm
Markov chain Monte Carlo
mixing time
scalable inference
First Author
Chenyang Zhong, Department of Statistics, Columbia University
Presenting Author
Chenyang Zhong, Department of Statistics, Columbia University