SPEED 1: Data Challenge, Bayesian Analysis, and Statistical Applications, Part 1

Jeremy Gaskins Chair
University of Louisville
 
Sunday, Aug 2: 2:00 PM - 3:50 PM
7103 
Contributed Speed 
Thomas M. Menino Convention & Exhibition Center 
Room: CC-102A 

Presentations

Exploring What Works in Education

This entry to the 2026 Data Expo Challenge will use the What Works Clearinghouse database explore what works in education. Both static and interactive graphics will be used to communicate the results. 

Speaker

Eleanor Hopkins

Co-Author(s)

Eleanor Hopkins
Angela Liang

Engaged or Absent: What Economic Shocks Reveal About the Wrong Interventions and the Right Ones

Despite widespread investment in attendance-targeted interventions, chronic absenteeism has doubled in U.S. public schools since 2018 and remains elevated. We argue the field has been treating a symptom rather than the underlying condition. Using the What Works Clearinghouse as an analytical dataset, we examine whether interventions designed for attendance outperform engagement-oriented programs that reduce absenteeism as a secondary effect. We then test whether this pattern holds under real-world conditions by studying how economic shocks — events that suddenly disrupt family stability and student engagement — differentially affect districts depending on which type of program they had in place. Our findings point toward a reorientation of both research and policy away from attendance monitoring and toward engagement as the more promising lever for keeping students in school. 

Keywords

The Annual Data Challenge Expo 

Speaker

Lesia Semenova, Rutgers University

From Intervention to Impact: Analyzing Educational Effectiveness in U.S.

Evidence-based education policy increasingly relies on systematic reviews to inform the adoption and scaling of instructional programs. We examine heterogeneity in educational intervention effectiveness using the What Works Clearinghouse dataset maintained by the U.S. Department of Education. To address the question of "what works, for whom, and under what conditions," we analyze intervention–protocol combinations across outcome domains and student subgroups. In contrast to prior work that emphasizes average intervention effects, our analysis focuses on subgroup-specific effectiveness across populations, geographic regions, educational contexts, and implementation conditions. We leverage hierarchical models to account for findings nested within studies and interventions, and integrate external data from the National Center for Education Statistics and the U.S. Census Bureau to incorporate broader educational and demographic context. By comparing overall and subgroup-specific results, we identify interventions with broad effectiveness as well as those whose impacts are concentrated in particular populations or settings. 

Keywords

Subgroup Analysis

Hierarchical Modeling

Causal Inference 

Speaker

Mohammed Rahman

Co-Author(s)

Aditi Sen
Ayoushman Bhattacharya, Washington University in St. Louis

AI-Driven Sentiment Analysis and LDA-Based Topic Modeling with Automated Summarization

Online consumers provide rich, unstructured textual data to express their satisfaction and dissatisfaction. It has motivated researchers to investigate how to systematically process, analyze, and extract meaningful patterns hidden in the large volumes of unstructured text. This study combines a fine-tuned local large language model (AI) with sentiment analysis and LDA-based topic modeling of online reviews to identify what customers truly care about and which factors contribute to positive feedback. This novel collaborative approach improves the interpretability and readability of the large volume of comments, no longer limiting them to explicit sentiment classification before sentiment analysis, and generates concise, readable summary sentences. Using customer reviews as a case, results reveal 10 themes are prominently featured in positive reviews. These findings suggest that both accommodation comfort and positive host–guest interactions increase the likelihood of positive reviews. 

Keywords

Sentiment Analysis

Large Language Model, Fine Tuned AI, Local LLM

Artificial Intelligence

Latent Dirichlet

Multinomial Logistic Regression

Topic Modeling 

Speaker

Bong-Jin Choi, North Dakota State University

Co-Author(s)

Gaoya Tu, NDSU
Jing Bai, North Dakota State University Main Campus

Bayesian infinite interactive fixed effects modeling for causal inference

Causal inference for single treatment effect estimation is challenging due to the absence of valid control units. The synthetic control method (SCM) offers an innovative way of constructing the so-called data-driven control unit. The generalized synthetic control (GSC) method is proposed as a factor model-based extension of SCM. While GSC improves upon SCM, the performance of GSC heavily depends on the choice of the number of latent factors. To account for the uncertainty associated with the number of factors, we propose to employ a Bayesian infinity factor modeling approach. The key idea of our Bayesian infinity factor modeling is to assign a cumulative shrinkage process prior on the factor loadings. In addition, we apply a Gaussian process approach to infer the non-linear treatment effect. The proposed Bayesian framework enables us to make full Bayesian inference about the time-varying treatment effect. The merits of the proposed Bayesian method are demonstrated through simulation studies and real data analysis. 

Keywords

Bayesian infinity factor model

Causal inference

Interactive fixed effects model

Synthetic control method 

Speaker

Junha Seo, Kyungpook National University

Co-Author

Gyuhyeong Goh, Kyungpook National University

BRISK: Rank-Based Bayesian Feature Selection for High-Dimensional Data via Robust Permutation Kernel

High-dimensional biomedical data pose challenges such as multicollinearity, small sample sizes, and instability in variable selection. Feature selection is crucial for interpretability, reproducibility, and robust statistical learning. Traditional Bayesian methods, such as SSVS and spike-and-slab priors, often depend on precise distributional assumptions and are sensitive to prior choices, limiting their stability and transparency in prioritizing variables. We propose BRISK, a rank-based Bayesian feature selection framework utilizing robust permutation kernels, which unifies feature-specific evidence into stable rankings based on association strength and consistency across modeling scenarios. Unlike binary selection, BRISK generates a prioritized list, aiding validation and expert review. Empirical results on simulated and real omics data demonstrate BRISK's superior stability, reduced false discoveries, and improved predictive accuracy compared to traditional methods, especially when predictors are correlated or samples are limited. BRISK offers a reliable, interpretable approach for high-dimensional biomedical feature selection. 

Keywords

Feature selection

High-dimensional data

Rank-based framework

Bayesian methods

Stability

Omics data application 

Speaker

Sakib Salam

Co-Author

Anjishnu Banerjee

BSTFA: An R Package for Efficient Bayesian Spatio-temporal Factor Analysis

Factor analysis methods are widely used for exploring latent characteristics of a random process. Spatio-temporal factor analysis extends this approach to account for spatial and temporal dependencies in spatio-temporal data. By modeling these dependencies, the estimated processes can be interpolated to unobserved locations. However, Bayesian spatio-temporal models are notoriously computationally burdensome, limiting their practical use. To address this, we developed the BSTFA package in R to automatically fit an efficient Bayesian spatio-temporal factor analysis model using dimension-reduced basis functions. The BSTFA package is user-friendly, computationally fast, and provides a powerful tool for modeling and interpreting spatio-temporal dependencies. We demonstrate its utility with a case study modeling PM 2.5 levels across the state of California for 25 years. 

Keywords

Bayesian modeling

R Package

Spatio-temporal data

MCMC

Latent analysis

Basis functions 

Speaker

Adam Simpson

Co-Author

Candace Berrett, Brigham Young University

Density-based anomaly detection for functional data via archetypal analysis

Archetypal analysis (AA) is an interpretable, geometry-based unsupervised learning method that has been extended to functional data, where observations are represented as curves. In many industrial applications, such as manufacturing process monitoring, sensor measurements are inherently functional, making anomaly detection a critical task for monitoring and control. AA-based representations summarize functional observations using archetype coefficients, providing compact and interpretable features for functional anomaly detection. However, existing AA-based approaches often rely on heuristic and user-dependent criteria to identify anomalies, which can limit reproducibility and robustness. Motivated by characteristic distributional gaps observed in archetype coefficient spaces, we propose a density-based anomaly detection framework that identifies anomalies as observations located in low-density regions. Anomaly scores are defined using kernel density estimation, and decision thresholds are determined automatically by contamination-based quantiles. The proposed method is evaluated using simulation studies and an application to real semiconductor manufacturing process sensor data. 

Keywords

Archetypal analysis

Functional data

Manufacturing process

Anomaly detection

Density estimation

Decision threshold 

Speaker

Hee Su Lee

Co-Author

Min Ho Cho

Entrepreneurs threat as the ruin event: the risk's negative perusal by the Erlang(n) model.

Entrepreneurial threats are extreme realizations of investment risk, materializing as ruin events in the surplus dynamics of entrepreneurial or investment activities. Across social, economic, or institutional Entrepreneurships Interests Focus (E.I.F.), with Acting Levels (E.A.L.) Micro, Meso, or Macro, and Dynamic Trends (E.D.T.) innovation, impact, or problem-solving, threats represent adverse risks that may cause irreversible failure. Entrepreneurial risk perceived negatively is represented by the surplus process R[π]...[/π] = u + ct − ∑X... − L[π].../π, with ruin defined as R[π].../π < 0 for t ≥ 0, initial capital u ≥ 0, and premium income c > 0. Within a generalized Erlang(n) inter-arrival structure, the maximum severity of ruin provides an extreme indicator of loss, enabling explicit downside risk assessment. Crossing E.I.F., E.A.L., and E.D.T. via paired extended decision analysis into a dynamic n-dimensional decision space, non-parametrics density estimation at each risk state allows real-time computation of losses through m[th]...[/th] moments of failure dividend D[u]...[/u], yielding lower bounds for optimal strategies within the Erlang(n) risk model. 

Keywords

Entrepreneur Threats

Maximum severity of ruin

Controlled Surplus Process

Entrepreneurships Interests Focus (E.I.F.)

Entrepreneurships Acting Level (E.A.L.)

Entrepreneurships Dynamic Trend (E.D.T.) 

Speaker

Mfondoum Valery

Extending Multifidelity Models Using Normalizing Flows

A common situation in statistical computer experiments is when we have multiple models for the same phenomenon, where accuracy and computational cost vary across the different models. Kennedy and O'Hagan (2000) popularized a framework for this 'multi-fidelity problem' that relates the different models in a linear way using Gaussian processes, which was then extended to the nonlinear setting in Perdikaris et al. (2017). In this work, we further extend this framework by using normalizing flows. Normalizing flows are a method of statistical inference where we transform a simple 'base' distribution into a more complex distribution with a series of invertible and differentiable transformations (see e.g. Papamakarios et al. 2020). We show results from numerical studies showing that using normalizing flows for this problem performs well and is flexible. 

Keywords

Multifidelity Modeling

Computer Experiments

Normalizing Flows 

Speaker

Lloyd Goldstein, University of Cincinnati

Co-Author

Emily Kang, University of Cincinnati

Interpretable anomaly detection framework using functional logistic regression

Sensor-driven anomaly detection is critical for yield and quality stability in semiconductor manufacturing. In real-world deployment, interpretability is increasingly necessary to determine which process variables drive anomalies and the timing of their occurrence. Multivariate process sensor time series are often noisy, incomplete, irregularly sampled, and high-dimensional, which complicates stable modeling and root-cause attribution. We propose an interpretable framework that smooths sensor traces with splines to form functional predictors, evaluates them on a common, feature-wise normalized time grid. Anomaly probabilities are then estimated using a variable-selecting functional logistic regression, which simultaneously identifies contributing sensors. To localize effects in time, we partition predictors and coefficient functions into predefined intervals to compute interval-wise contributions, yielding sensor-interval attributions. We demonstrate our method using some simulations with known anomaly sources and apply it to the D2 wafer-trace dataset (multivariate process sensor time series from semiconductor manufacturing) to identify contributing sensors and time intervals. 

Keywords

Anomaly detection

Semiconductor manufacturing

Interpretability

Multivariate sensor time series

Functional logistic regression

Interval-wise contribution 

Speaker

Sang Jin Choi

Co-Author

Min Ho Cho

K–Anonymity–Aware Sequential Sampling Method for Synthetic Data

In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release. 

Keywords

Synthetic tabular data

Data Privacy

k-anonymity

Sequential sampling

Normalized mutual information 

Speaker

TaeWook Kim

Co-Author(s)

Jeongyoun Ahn, KAIST
Changwon Yoon, Department of Industrial & Systems Engineering, KAIST
Cheolwoo Park, KAIST
Bonwoo Lee

Marginalization with Moment Generating Functions with Applications in Astrostatistics

We present a new analytical method to derive a likelihood function that is marginalised over a population. This method can be used for computational advantage in the context of Bayesian hierarchical models and marginal likelihood calculations in Bayesian models. The key innovation is the specification of the necessary integrals in terms of high-order (sometimes fractional) derivatives of the population prior moment-generating function, if particular existence and differentiability conditions hold.
We confine our attention to Poisson and gamma likelihood functions. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.
We also present examples validating this new analytical method. In some of the examples, the new method is the only known analytical method to calculate the integral, giving instantaneous and accurate calculations. 

Keywords

model evidences

fractional derivatives

moment-generating function

integration

Bayesian modeling 

Speaker

Siyang Li

Co-Author(s)

David van Dyk, Imperial College London
Maximilian Autenrieth, University of Cambridge

Maximum likelihood estimation for the Dirichlet distribution

The Dirichlet distribution is a multivariate generalization of the Beta distribution, defining a family of unit sum-constrained probabilities or proportions in a multi-dimensional simplex. This distribution is usually the first choice in modeling compositional data and has been applied in various fields, including modeling microbiome data, text classification, and market share analysis. The existing literature suggests that the maximum likelihood estimator (MLE) is the most effective method for estimating Dirichlet parameters. However, asignificant issue is that simply assuming the existence and uniqueness of the MLE for the Dirichlet model parameter without an analytic proof can lead to a meaningless interpretation of its bias and/or relative mean squared error. First, we address this problem by proving the existence and uniqueness of MLEs for the general Dirichlet distribution parameters. Our method relies on a particular representation of the digamma function, and our proof is much simpler than the one by Ronning (1989). In the course of our investigation, we have also proved a conjecture left open by Ronning (1989) for the computation of the MLE, thereby bringing 

Keywords

Apery’s constant

Beta function

Digamma function

Euler’s constant

Log-likelihood function

Trigamma function 

Speaker

Sucharitha Dodamgodage

A Bivariate Asymmetric Spatial Covariance Model with Consistent Marginals

We develop a bivariate asymmetric spatial covariance model using the conditional approach of Cressie and Zammit-Mangion (2016). Our construction includes both a simple pointwise dependence interaction and an asymmetric dependence interaction, while ensuring that both marginal covariance functions are from the same family such as the Matérn class. Unlike the conditional formulation in Cressie and Zammit-Mangion (2016), in which the same covariance family is imposed on only one marginal and a conditional component, our approach preserves the same marginals for both observed variables, aiding interpretation, comparison with familiar univariate spatial models, and model specification in practice. We derive sufficient conditions under which the resulting bivariate covariance function is positive definite. Through simulation studies with cokriging, we compare the proposed models with existing symmetric and asymmetric Matérn-based alternatives and assess predictive performance. We further illustrate the superior performance of the proposed models with a spatial temperature-pressure dataset.  

Keywords

Geostatistics

Asymmetry

Matérn covariance

Cross-covariance

Conditional approach 

Speaker

Valerie Han, Iowa State University

Co-Author

Pulong Ma, Iowa State University

Salient Bayesian Clustering for Proportional Data via a Multivariate Beta Mixture Model

We develop a Bayesian overfitted multivariate beta mixture model for clustering aggregated ecological data bounded between 0 and 1. Such data, common in social determinants of health (SDoH) research, pose challenges for standard clustering methods due to restrictive distributional assumptions and limited interpretability. The proposed model reparameterizes the multivariate beta distribution in terms of mean and concentration parameters, enabling direct interpretation of cluster-specific profiles while accommodating skewness inherent in the data. Integrated feature saliency operates on cluster means to induce sparsity by identifying variables that meaningfully drive clustering and shrinking uninformative features toward a shared mean. An overfitted mixture formulation supports data-driven inference on the number of clusters while preserving posterior uncertainty. We assess performance through simulation studies and apply the model to neighborhood-level SDoH data from the Agency for Healthcare Research and Quality, yielding interpretable ecological clusters. The framework generalizes to a broad class of bounded, aggregated multivariate data. 

Keywords

Bayesian mixture model


multivariate beta distribution

sparse modeling

ecological data

feature saliency 

Speaker

Carmen Rodriguez Cabrera, Harvard University

Co-Author

Briana Stephenson, Harvard T.H. Chan School of Public Health

Statistical stress testing of the global sea level in the alternative climate scenarios

This study develops a statistical framework to forecast global sea-level change as a function of atmospheric carbon dioxide (CO₂) concentrations and global temperature and to conduct stress testing under alternative climate policy scenarios. Three scenarios were considered: a) an expected scenario reflecting current emission trends, b) a best-case scenario assuming compliance with Kyoto Protocol CO₂ reduction targets, and c) a worst-case scenario assuming CO₂ emissions increase at a rate opposite to the Kyoto Protocol targets.

The analysis employs a three-stage modeling approach based on Seasonal Autoregressive Integrated Moving Average models with exogenous variables (SARIMAX). In the first stage, CO₂ dynamics are modeled using a univariate SARIMAX specification. In the second stage, global temperature is modeled with lagged temperature and CO₂ as an exogenous predictor. In the final stage, the sea level is modeled as a function of its own dynamics and lagged global temperature.

Results indicate continued sea-level rise under the expected scenario, partial stabilization under the best-case scenario, and accelerated increases under the worst-case scenario. The proposed methodology demonstrates how classical time-series methods can be used for climate stress testing and policy analysis.
 

Keywords

stress testing

SARIMAX

global warming

predictive modeling

temperature, CO2

sea level 

Speaker

Ian (Yan) Yankovsky, Fletcher Middle School

Co-Author

Eugene Yankovsky, ProfeSci Inc

Tensor Bayesian Copula Factor Models for High-Dimensional Mixed Time Series Data

We propose a tensor Bayesian copula factor autoregressive model with multivariate responses for analyzing mixed-type time series data with both main effects and interactions. The model is motivated by the need to study dynamic relationships between macroeconomic variables and stock market indices, leading naturally to tensor-valued posterior distributions. Dependence is captured through latent factors in both the multivariate response time series and high-dimensional mixed-type covariates within a quadratic time series regression framework coupled with copula functions. To enhance computational efficiency, we employ a semiparametric extended rank likelihood for the marginal distributions of the covariates, substantially reducing parameter dimensionality. Posterior inference is performed using Metropolis–Hastings and Forward Filtering Backward Sampling algorithms embedded in a Gibbs sampling scheme. The effectiveness of the proposed methodology is demonstrated through extensive simulation studies and an application to a real macroeconomic dataset. 

Keywords

Multivariate tensor time series

Bayesian inference

Factor analysis

Copula models 

Speaker

Hadi Safari Katesari

Co-Author(s)

S. Yaser Samadi, Southern Illinois University-Carbondale
Yisu Hou, Quantitative Methods in the Social Sciences, Columbia University
Samira Zaroudi, John Jay College of Criminal Justice-CUNY

Transfer Topic Modeling for Identifying Depression Subtypes in Youth

Depression is increasing among adolescents and young adults. Identifying clinically meaningful subtypes of depression and comorbidity patterns in youth is therefore a critical priority. Topic modeling of electronic health records (EHR) offers a promising strategy to uncover latent depression phenotypes and patient subtypes underlying diagnostic co-occurrence. Yet two analytical barriers remain: existing methods often lack the efficiency needed for large EHR datasets with high-dimensional features, and the smaller, sparser records common in youth populations hinder estimation stability. To overcome these challenges, we propose a transfer topic modeling approach that integrates the computationally efficient Topic-SCORE algorithm with a ridge-type estimator to stabilize latent subspace estimation and improve topic matrix recovery. Simulation studies show our method outperforms models trained only on the target population and alternative transfer learning approaches. Leveraging the All of Us Research Program, our method identifies seven clinically meaningful latent structures of youth depression and distinguishes subgroups at elevated risk for suicidal thoughts and behaviors. 

Keywords

Electronic health records

Representation learning

Transfer learning

Mental health

Suicide risk 

Speaker

Yu-Jyun Huang, Harvard University

Co-Author

Rui Duan, Harvard University