Print Close

SPEED 1: Data Challenge, Bayesian Analysis, and Statistical Applications, Part 1

Sunday, Aug 2: 2:00 PM - 3:50 PM
Contributed Speed

Presentations

Exploring What Works in Education

TBD

Speaker

Adam Loy, Carleton College

Interpretable Pattern Discovery in the What Works Clearinghouse (WWC) Database

The What Works Clearinghouse (WWC) database aggregates findings from thousands of education studies and includes structured indicators of evidence strength. We will develop an interpretable workflow to identify recurring patterns in reported findings across interventions, outcomes, grade levels, and available contexts, while quantifying heterogeneity. Our analysis will incorporate evidence tiers and standards ratings to assess robustness and will use simple interpretable summaries supported by diagnostic visualizations.

Keywords

The Annual Data Challenge Expo

View Abstract 3639

Speaker

Lesia Semenova, Rutgers University

From Intervention to Impact: Analyzing Educational Effectiveness in U.S.

Evidence-based education policy increasingly depends on systematic reviews to guide the adoption and scaling of instructional programs. Our goal is to provide a more nuanced understanding of "what works, for whom, and under what conditions" in education. In this research, we analyze the What Works Clearinghouse dataset, maintained by the U.S. Department of Education, to examine heterogeneity in educational intervention effectiveness.
The WWC dataset provides a comprehensive synthesis of educational research in a multi-level structure, linking findings, studies, and intervention reports. We investigate what works", i.e., which version of an intervention works or in other works which intervention is effective under what protocol. We explain the ``for whom" and "under what conditions" components by leveraging information on region, school type, grades, race, sex, etc. To understand if such an interventions-protocol combination ``works", we investigate variables, which measure impact of intervention pertaining to specific outcome domains. In contrast to most applications of WWC data emphasizing on average intervention effectiveness, our goal in this study is subgroup analysis.

View Abstract 3749

Speaker

Mohammed Rahman

Co-Author(s)

Aditi Sen
Ayoushman Bhattacharya, Washington University in St. Louis

Advances in Spatial Integer-Valued Time Series Modeling

This study develops an empirical Bayes (EB) spatial hurdle INGARCH model for weekly dengue fever counts and compares it with a spatial zero-inflated generalized Poisson (ZIGP) INGARCH framework, both capturing spatio-temporal dependence and excess zeros. The EB-hurdle model introduces a data-adaptive prior that reduces model complexity and enhances parsimony and stability while retaining flexibility for dynamic zero inflation. In spatial INGARCH models, the zero-generating mechanism depends on lagged outcomes, implying that covariate effects influence the intensity equation indirectly. To enhance epidemiological relevance, seasonal patterns are incorporated into the log-intensity equations using Fourier-based harmonic terms and meteorological covariates. Model inference is conducted within a Bayesian framework. The results highlight the distinct roles of seasonal and environmental drivers in dengue transmission and demonstrate that Fourier-based periodic components provide an effective alternative when meteorological data are limited or unavailable. Overall, the empirical Bayes approach offers a parsimonious and stable improvement over conventional hurdle INGARCH models.

Keywords

Bayesian inference

Dengue fever

Fourier series

INGARCH models

Spatio-temporal modeling

Zero-inflated count data

View Abstract 1954

Speaker

Cathy Woan-Shu Chen, Feng Chia University

Co-Author(s)

Chun-Shu Chen, National Central University
Hsiao-Hsuan Liao, Department of Statistics, Feng Chia University

AI-Driven Sentiment Analysis and LDA-Based Topic Modeling with Automated Summarization for Airbnb Re

Online consumers provide rich, unstructured textual data to express their satisfaction and dissatisfaction. It has motivated researchers to investigate how to systematically process, analyze, and extract meaningful patterns hidden in the large volumes of unstructured text. This study combines the Large Language Model(AI) with sentiment analysis and LDA-based topic modeling for online reviews to explore what customer truly care about and which factors contribute to positive feedback. This novel collaborative approach improves the interpretability and readability of the large volume of comments, no longer limiting them to explicit sentiment classification before sentiment analysis, and generates concise, readable summary sentences. Using customer reviews as a case, results reveal 10 themes are prominently featured in positive reviews. These findings suggest that both accommodation comfort and positive host–guest interactions increase the likelihood of positive reviews.

Keywords

Sentiment Analysis

Large Language Model

Artificial Intelligence

Latent Dirichlet

Multinomial Logistic Regression

Topic Modeling

View Abstract 3711

Speaker

Bong-Jin Choi, North Dakota State University

Co-Author(s)

Gaoya Tu, NDSU
Jing Bai, North Dakota State University Main Campus

Bayesian infinite interactive fixed effects modeling for causal inference

Causal inference for single treatment effect estimation is challenging due to the absence of valid control units. The synthetic control method (SCM) offers an innovative way of constructing the so-called data-driven control unit. The generalized synthetic control (GSC) method is proposed as a factor model-based extension of SCM. While GSC improves upon SCM, the performance of GSC heavily depends on the choice of the number of latent factors. To account for the uncertainty associated with the number of factors, we propose to employ a Bayesian infinity factor modeling approach. The key idea of our Bayesian infinity factor modeling is to assign a cumulative shrinkage process prior on the factor loadings. In addition, we apply a Gaussian process approach to infer the non-linear treatment effect. The proposed Bayesian framework enables us to make full Bayesian inference about the time-varying treatment effect. The merits of the proposed Bayesian method are demonstrated through simulation studies and real data analysis.

Keywords

Bayesian infinity factor model

Causal inference

Interactive fixed effects model

Synthetic control method

View Abstract 2145

Speaker

Junha Seo, Kyungpook National University

Co-Author

Gyuhyeong Goh, Kyungpook National University

BRISK: Rank-Based Bayesian Feature Selection for High-Dimensional Data via Robust Permutation Kernel

High-dimensional biomedical data pose challenges such as multicollinearity, small sample sizes, and instability in variable selection. Feature selection is crucial for interpretability, reproducibility, and robust statistical learning. Traditional Bayesian methods, such as SSVS and spike-and-slab priors, often depend on precise distributional assumptions and are sensitive to prior choices, limiting their stability and transparency in prioritizing variables. We propose BRISK, a rank-based Bayesian feature selection framework utilizing robust permutation kernels, which unifies feature-specific evidence into stable rankings based on association strength and consistency across modeling scenarios. Unlike binary selection, BRISK generates a prioritized list, aiding validation and expert review. Empirical results on simulated and real omics data demonstrate BRISK's superior stability, reduced false discoveries, and improved predictive accuracy compared to traditional methods, especially when predictors are correlated or samples are limited. BRISK offers a reliable, interpretable approach for high-dimensional biomedical feature selection.

Keywords

Feature selection

High-dimensional data

Rank-based framework

Bayesian methods

Stability

Omics data application

View Abstract 2976

Speaker

Sakib Salam

Co-Author

Anjishnu Banerjee

BSTFA: An R Package for Efficient Bayesian Spatio-temporal Factor Analysis

Factor analysis methods are widely used for exploring latent characteristics of a random process. Spatio-temporal factor analysis extends this approach to account for spatial and temporal dependencies in spatio-temporal data. By modeling these dependencies, the estimated processes can be interpolated to unobserved locations. However, Bayesian spatio-temporal models are notoriously computationally burdensome, limiting their practical use. To address this, we developed the BSTFA package in R to automatically fit an efficient Bayesian spatio-temporal factor analysis model using dimension-reduced basis functions. The BSTFA package is user-friendly, computationally fast, and provides a powerful tool for modeling and interpreting spatio-temporal dependencies. We demonstrate its utility with a case study modeling PM 2.5 levels across the state of California for 25 years.

Keywords

Bayesian modeling

R Package

Spatio-temporal data

MCMC

Latent analysis

Basis functions

View Abstract 1995

Speaker

Adam Simpson

Co-Author

Candace Berrett, Brigham Young University

Density-based anomaly detection for functional data via archetypal analysis

Archetypal analysis (AA) is an interpretable, geometry-based unsupervised learning method that has been extended to functional data, where observations are represented as curves. In many industrial applications, such as manufacturing process monitoring, sensor measurements are inherently functional, making anomaly detection a critical task for monitoring and control. AA-based representations summarize functional observations using archetype coefficients, providing compact and interpretable features for functional anomaly detection. However, existing AA-based approaches often rely on heuristic and user-dependent criteria to identify anomalies, which can limit reproducibility and robustness. Motivated by characteristic distributional gaps observed in archetype coefficient spaces, we propose a density-based anomaly detection framework that identifies anomalies as observations located in low-density regions. Anomaly scores are defined using kernel density estimation, and decision thresholds are determined automatically by contamination-based quantiles. The proposed method is evaluated using simulation studies and an application to real semiconductor manufacturing process sensor data.

Keywords

Archetypal analysis

Functional data

Manufacturing process

Anomaly detection

Density estimation

Decision threshold

View Abstract 2199

Speaker

Hee Su Lee

Co-Author

Min Ho Cho

Entrepreneurs threat as the ruin event: the risk's negative perusal by the Erlang(n) model.

Entrepreneurial threats are extreme realizations of investment risk, materializing as ruin events in the surplus dynamics of entrepreneurial or investment activities. Across social, economic, or institutional Entrepreneurships Interests Focus (E.I.F.), with Acting Levels (E.A.L.) Micro, Meso, or Macro, and Dynamic Trends (E.D.T.) innovation, impact, or problem-solving, threats represent adverse risks that may cause irreversible failure. Entrepreneurial risk perceived negatively is represented by the surplus process R[π]...[/π] = u + ct − ∑X... − L[π].../π, with ruin defined as R[π].../π < 0 for t ≥ 0, initial capital u ≥ 0, and premium income c > 0. Within a generalized Erlang(n) inter-arrival structure, the maximum severity of ruin provides an extreme indicator of loss, enabling explicit downside risk assessment. Crossing E.I.F., E.A.L., and E.D.T. via paired extended decision analysis into a dynamic n-dimensional decision space, non-parametrics density estimation at each risk state allows real-time computation of losses through m[th]...[/th] moments of failure dividend D[u]...[/u], yielding lower bounds for optimal strategies within the Erlang(n) risk model.

Keywords

Entrepreneur Threats

Maximum severity of ruin

Controlled Surplus Process

Entrepreneurships Interests Focus (E.I.F.)

Entrepreneurships Acting Level (E.A.L.)

Entrepreneurships Dynamic Trend (E.D.T.)

View Abstract 2153

Speaker

Mfondoum Valery

Extending Multifidelity Models Using Normalizing Flows

A common situation in statistical computer experiments is when we have multiple models for the same phenomenon, where accuracy and computational cost vary across the different models. Kennedy and O'Hagan (2000) popularized a framework for this 'multi-fidelity problem' that relates the different models in a linear way using Gaussian processes, which was then extended to the nonlinear setting in Perdikaris et al. (2017). In this work, we further extend this framework by using normalizing flows. Normalizing flows are a method of statistical inference where we transform a simple 'base' distribution into a more complex distribution with a series of invertible and differentiable transformations (see e.g. Papamakarios et al. 2020). We show results from numerical studies showing that using normalizing flows for this problem performs well and is flexible.

Keywords

Multifidelity Modeling

Computer Experiments

Normalizing Flows

View Abstract 2626

Speaker

Lloyd Goldstein, University of Cincinnati

Co-Author

Emily Kang, University of Cincinnati

Interpretable anomaly detection framework using functional logistic regression

Sensor-driven anomaly detection is critical for yield and quality stability in semiconductor manufacturing. In real-world deployment, interpretability is increasingly necessary to determine which process variables drive anomalies and the timing of their occurrence. Multivariate process sensor time series are often noisy, incomplete, irregularly sampled, and high-dimensional, which complicates stable modeling and root-cause attribution. We propose an interpretable framework that smooths sensor traces with splines to form functional predictors, evaluates them on a common, feature-wise normalized time grid. Anomaly probabilities are then estimated using a variable-selecting functional logistic regression, which simultaneously identifies contributing sensors. To localize effects in time, we partition predictors and coefficient functions into predefined intervals to compute interval-wise contributions, yielding sensor-interval attributions. We demonstrate our method using some simulations with known anomaly sources and apply it to the D2 wafer-trace dataset (multivariate process sensor time series from semiconductor manufacturing) to identify contributing sensors and time intervals.

Keywords

Anomaly detection

Semiconductor manufacturing

Interpretability

Multivariate sensor time series

Functional logistic regression

Interval-wise contribution

View Abstract 2076

Speaker

Sang Jin Choi

Co-Author

Min Ho Cho

K–Anonymity–Aware Sequential Sampling Method for Synthetic Data

In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release.

Keywords

Synthetic tabular data

Data Privacy

k-anonymity

Sequential sampling

Normalized mutual information

View Abstract 2529

Speaker

TaeWook Kim

Co-Author(s)

Jeongyoun Ahn, KAIST
Changwon Yoon, Department of Industrial & Systems Engineering, KAIST
Cheolwoo Park, KAIST
Bonwoo Lee

Marginalization with Moment Generating Functions with Applications in Astrostatistics

We present a new analytical method to derive a likelihood function that is marginalised over a population. This method can be used for computational advantage in the context of Bayesian hierarchical models and marginal likelihood calculations in Bayesian models. The key innovation is the specification of the necessary integrals in terms of high-order (sometimes fractional) derivatives of the population prior moment-generating function, if particular existence and differentiability conditions hold.
We confine our attention to Poisson and gamma likelihood functions. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.
We also present examples validating this new analytical method. In some of the examples, the new method is the only known analytical method to calculate the integral, giving instantaneous and accurate calculations.

Keywords

model evidences

fractional derivatives

moment-generating function

integration

Bayesian modeling

View Abstract 2009

Speaker

Siyang Li

Co-Author(s)

David van Dyk, Imperial College London
Maximilian Autenrieth, University of Cambridge

Maximum likelihood estimation for the Dirichlet distribution

The Dirichlet distribution is a multivariate generalization of the Beta distribution, defining a family of unit sum-constrained probabilities or proportions in a multi-dimensional simplex. This distribution is usually the first choice in modeling compositional data and has been applied in various fields, including modeling microbiome data, text classification, and market share analysis. The existing literature suggests that the maximum likelihood estimator (MLE) is the most effective method for estimating Dirichlet parameters. However, asignificant issue is that simply assuming the existence and uniqueness of the MLE for the Dirichlet model parameter without an analytic proof can lead to a meaningless interpretation of its bias and/or relative mean squared error. First, we address this problem by proving the existence and uniqueness of MLEs for the general Dirichlet distribution parameters. Our method relies on a particular representation of the digamma function, and our proof is much simpler than the one by Ronning (1989). In the course of our investigation, we have also proved a conjecture left open by Ronning (1989) for the computation of the MLE, thereby bringing

Keywords

Apery’s constant

Beta function

Digamma function

Euler’s constant

Log-likelihood function

Trigamma function

View Abstract 1947

Speaker

Sucharitha Dodamgodage

Multivariate Asymmetric Spatial Covariance with Confluent Hypergeometric Marginals

We introduce a multivariate asymmetric spatial covariance model that replaces Matérn marginals with confluent hypergeometric (CH) covariance functions, providing greater flexibility in smoothness and tail behavior. Environmental processes often exhibit spatial delays, especially under the influence of prevailing wind or water flows, resulting in asymmetric cross-covariances between variables. Our construction operates within the conditional framework of Cressie and Zammit-Mangion (2016), using interaction functions to encode asymmetric cross-variable dependence. We give sufficient conditions on a class of interaction functions under which the resulting CH-based multivariate covariance is positive definite. We evaluate performance through simulation with cokriging, comparing predictive accuracy against multivariate symmetric and asymmetric Matérn-based models and a univariate CH model. We also illustrate the approach with a temperature and pressure dataset, showing improved fit and spatial prediction relative to Matérn-based alternatives.

Keywords

Spatial statistics

Multivariate spatial statistics

Asymmetric cross-covariance

Confluent hypergeometric covariance

Long-range dependence

Conditional approach

View Abstract 3617

Speaker

Valerie Han, Iowa State University

Co-Author

Pulong Ma, Iowa State University

Sparse Bayesian Clustering for Bounded Data via a Multivariate Beta Mixture Model

We develop a Bayesian overfitted multivariate beta mixture model for clustering aggregated ecological data bounded between 0 and 1. Such data, common in social determinants of health (SDoH) research, pose challenges for standard clustering methods due to restrictive distributional assumptions and limited interpretability. The proposed model reparameterizes the multivariate beta distribution in terms of mean and concentration parameters, enabling direct interpretation of cluster-specific profiles while accommodating skewness inherent in the data. Integrated feature saliency operates on cluster means to induce sparsity by identifying variables that meaningfully drive clustering and shrinking uninformative features toward a shared mean. An overfitted mixture formulation supports data-driven inference on the number of clusters while preserving posterior uncertainty. We assess performance through simulation studies and apply the model to neighborhood-level SDoH data from the Agency for Healthcare Research and Quality, yielding interpretable ecological clusters. The framework generalizes to a broad class of bounded, aggregated multivariate data.

Keywords

Bayesian mixture model

multivariate beta distribution

sparse modeling

ecological data

feature saliency

View Abstract 3361

Speaker

Carmen Rodriguez Cabrera, Harvard University

Co-Author

Briana Stephenson, Harvard T.H. Chan School of Public Health

Statistical stress testing of the global sea level in the alternative climate scenarios

This study develops a statistical framework to forecast global sea-level change as a function of atmospheric carbon dioxide (CO₂) concentrations and global temperature and to conduct stress testing under alternative climate policy scenarios. Three scenarios are considered: a) an expected scenario reflecting current emission trends, b) a best-case scenario assuming compliance with Kyoto Protocol CO₂ reduction targets, and c) a worst-case scenario assuming CO₂ emissions increase at a rate opposite to the Kyoto targets.

The analysis employs a three-stage modeling approach based on Seasonal Autoregressive Integrated Moving Average models with exogenous variables (SARIMAX). In the first stage, CO₂ dynamics are modeled using a univariate SARIMAX specification. In the second stage, global temperature is modeled with lagged temperature and CO₂ as an exogenous predictor. In the final stage, sea level is modeled as a function of its own dynamics and lagged global temperature. The estimated models are used to generate sea-level projections under the three scenarios.

The results indicate significant sea-level rise under the expected scenario, stabilization under the best-case scenario

Keywords

stress testing

SARIMAX

global warming

predictive modeling

temperature, CO2

sea level

View Abstract 1848

Speaker

Ian Yankovsky

Co-Author

Eugene Yankovsky, The Clorox Company

Tensor Bayesian Copula Factor Models for High-Dimensional Mixed Time Series Data

We propose a tensor Bayesian copula factor autoregressive model with multivariate responses for analyzing mixed-type time series data with both main effects and interactions. The model is motivated by the need to study dynamic relationships between macroeconomic variables and stock market indices, leading naturally to tensor-valued posterior distributions. Dependence is captured through latent factors in both the multivariate response time series and high-dimensional mixed-type covariates within a quadratic time series regression framework coupled with copula functions. To enhance computational efficiency, we employ a semiparametric extended rank likelihood for the marginal distributions of the covariates, substantially reducing parameter dimensionality. Posterior inference is performed using Metropolis–Hastings and Forward Filtering Backward Sampling algorithms embedded in a Gibbs sampling scheme. The effectiveness of the proposed methodology is demonstrated through extensive simulation studies and an application to a real macroeconomic dataset.

Keywords

Multivariate tensor time series

Bayesian inference

Factor analysis

Copula models

View Abstract 2128

Speaker

Hadi Safari Katesari, Hostos Community College-CUNY

Co-Author(s)

Samira Zaroudi, John Jay College of Criminal Justice-CUNY
S. Yaser Samadi, Southern Illinois University-Carbondale

Transfer Topic Modeling for Identifying Depression Subtypes in Youth

Depression is increasing among adolescents and young adults. Identifying clinically meaningful subtypes of depression and comorbidity patterns in youth is therefore a critical priority. Topic modeling of electronic health records (EHR) offers a promising strategy to uncover latent depression phenotypes and patient subtypes underlying diagnostic co-occurrence. Yet two analytical barriers remain: existing methods often lack the efficiency needed for large EHR datasets with high-dimensional features, and the smaller, sparser records common in youth populations hinder estimation stability. To overcome these challenges, we propose a transfer topic modeling approach that integrates the computationally efficient Topic-SCORE algorithm with a ridge-type estimator to stabilize latent subspace estimation and improve topic matrix recovery. Simulation studies show our method outperforms models trained only on the target population and alternative transfer learning approaches. Leveraging the All of Us Research Program, our method identifies seven clinically meaningful latent structures of youth depression and distinguishes subgroups at elevated risk for suicidal thoughts and behaviors.

Keywords

Electronic health records

Representation learning

Transfer learning

Mental health

Suicide risk

View Abstract 1917

Speaker

Yu-Jyun Huang, Harvard University

Co-Author

Rui Duan, Harvard University