Sunday, Aug 2: 2:00 PM - 3:50 PM
Contributed Speed
Presentations
TBD
The What Works Clearinghouse (WWC) database aggregates findings from thousands of education studies and includes structured indicators of evidence strength. We will develop an interpretable workflow to identify recurring patterns in reported findings across interventions, outcomes, grade levels, and available contexts, while quantifying heterogeneity. Our analysis will incorporate evidence tiers and standards ratings to assess robustness and will use simple interpretable summaries supported by diagnostic visualizations.
Keywords
The Annual Data Challenge Expo
Evidence-based education policy increasingly depends on systematic reviews to guide the adoption and scaling of instructional programs. Our goal is to provide a more nuanced understanding of "what works, for whom, and under what conditions" in education. In this research, we analyze the What Works Clearinghouse dataset, maintained by the U.S. Department of Education, to examine heterogeneity in educational intervention effectiveness.
The WWC dataset provides a comprehensive synthesis of educational research in a multi-level structure, linking findings, studies, and intervention reports. We investigate what works", i.e., which version of an intervention works or in other works which intervention is effective under what protocol. We explain the ``for whom" and "under what conditions" components by leveraging information on region, school type, grades, race, sex, etc. To understand if such an interventions-protocol combination ``works", we investigate variables, which measure impact of intervention pertaining to specific outcome domains. In contrast to most applications of WWC data emphasizing on average intervention effectiveness, our goal in this study is subgroup analysis.
This study develops an empirical Bayes (EB) spatial hurdle INGARCH model for weekly dengue fever counts and compares it with a spatial zero-inflated generalized Poisson (ZIGP) INGARCH framework, both capturing spatio-temporal dependence and excess zeros. The EB-hurdle model introduces a data-adaptive prior that reduces model complexity and enhances parsimony and stability while retaining flexibility for dynamic zero inflation. In spatial INGARCH models, the zero-generating mechanism depends on lagged outcomes, implying that covariate effects influence the intensity equation indirectly. To enhance epidemiological relevance, seasonal patterns are incorporated into the log-intensity equations using Fourier-based harmonic terms and meteorological covariates. Model inference is conducted within a Bayesian framework. The results highlight the distinct roles of seasonal and environmental drivers in dengue transmission and demonstrate that Fourier-based periodic components provide an effective alternative when meteorological data are limited or unavailable. Overall, the empirical Bayes approach offers a parsimonious and stable improvement over conventional hurdle INGARCH models.
Keywords
Bayesian inference
Dengue fever
Fourier series
INGARCH models
Spatio-temporal modeling
Zero-inflated count data
Online consumers provide rich, unstructured textual data to express their satisfaction and dissatisfaction. It has motivated researchers to investigate how to systematically process, analyze, and extract meaningful patterns hidden in the large volumes of unstructured text. This study combines the Large Language Model(AI) with sentiment analysis and LDA-based topic modeling for online reviews to explore what customer truly care about and which factors contribute to positive feedback. This novel collaborative approach improves the interpretability and readability of the large volume of comments, no longer limiting them to explicit sentiment classification before sentiment analysis, and generates concise, readable summary sentences. Using customer reviews as a case, results reveal 10 themes are prominently featured in positive reviews. These findings suggest that both accommodation comfort and positive host–guest interactions increase the likelihood of positive reviews.
Keywords
Sentiment Analysis
Large Language Model
Artificial Intelligence
Latent Dirichlet
Multinomial Logistic Regression
Topic Modeling
Causal inference for single treatment effect estimation is challenging due to the absence of valid control units. The synthetic control method (SCM) offers an innovative way of constructing the so-called data-driven control unit. The generalized synthetic control (GSC) method is proposed as a factor model-based extension of SCM. While GSC improves upon SCM, the performance of GSC heavily depends on the choice of the number of latent factors. To account for the uncertainty associated with the number of factors, we propose to employ a Bayesian infinity factor modeling approach. The key idea of our Bayesian infinity factor modeling is to assign a cumulative shrinkage process prior on the factor loadings. In addition, we apply a Gaussian process approach to infer the non-linear treatment effect. The proposed Bayesian framework enables us to make full Bayesian inference about the time-varying treatment effect. The merits of the proposed Bayesian method are demonstrated through simulation studies and real data analysis.
Keywords
Bayesian infinity factor model
Causal inference
Interactive fixed effects model
Synthetic control method
High-dimensional biomedical data pose challenges such as multicollinearity, small sample sizes, and instability in variable selection. Feature selection is crucial for interpretability, reproducibility, and robust statistical learning. Traditional Bayesian methods, such as SSVS and spike-and-slab priors, often depend on precise distributional assumptions and are sensitive to prior choices, limiting their stability and transparency in prioritizing variables. We propose BRISK, a rank-based Bayesian feature selection framework utilizing robust permutation kernels, which unifies feature-specific evidence into stable rankings based on association strength and consistency across modeling scenarios. Unlike binary selection, BRISK generates a prioritized list, aiding validation and expert review. Empirical results on simulated and real omics data demonstrate BRISK's superior stability, reduced false discoveries, and improved predictive accuracy compared to traditional methods, especially when predictors are correlated or samples are limited. BRISK offers a reliable, interpretable approach for high-dimensional biomedical feature selection.
Keywords
Feature selection
High-dimensional data
Rank-based framework
Bayesian methods
Stability
Omics data application
Factor analysis methods are widely used for exploring latent characteristics of a random process. Spatio-temporal factor analysis extends this approach to account for spatial and temporal dependencies in spatio-temporal data. By modeling these dependencies, the estimated processes can be interpolated to unobserved locations. However, Bayesian spatio-temporal models are notoriously computationally burdensome, limiting their practical use. To address this, we developed the BSTFA package in R to automatically fit an efficient Bayesian spatio-temporal factor analysis model using dimension-reduced basis functions. The BSTFA package is user-friendly, computationally fast, and provides a powerful tool for modeling and interpreting spatio-temporal dependencies. We demonstrate its utility with a case study modeling PM 2.5 levels across the state of California for 25 years.
Keywords
Bayesian modeling
R Package
Spatio-temporal data
MCMC
Latent analysis
Basis functions
Archetypal analysis (AA) is an interpretable, geometry-based unsupervised learning method that has been extended to functional data, where observations are represented as curves. In many industrial applications, such as manufacturing process monitoring, sensor measurements are inherently functional, making anomaly detection a critical task for monitoring and control. AA-based representations summarize functional observations using archetype coefficients, providing compact and interpretable features for functional anomaly detection. However, existing AA-based approaches often rely on heuristic and user-dependent criteria to identify anomalies, which can limit reproducibility and robustness. Motivated by characteristic distributional gaps observed in archetype coefficient spaces, we propose a density-based anomaly detection framework that identifies anomalies as observations located in low-density regions. Anomaly scores are defined using kernel density estimation, and decision thresholds are determined automatically by contamination-based quantiles. The proposed method is evaluated using simulation studies and an application to real semiconductor manufacturing process sensor data.
Keywords
Archetypal analysis
Functional data
Manufacturing process
Anomaly detection
Density estimation
Decision threshold
Entrepreneurial threats are extreme realizations of investment risk, materializing as ruin events in the surplus dynamics of entrepreneurial or investment activities. Across social, economic, or institutional Entrepreneurships Interests Focus (E.I.F.), with Acting Levels (E.A.L.) Micro, Meso, or Macro, and Dynamic Trends (E.D.T.) innovation, impact, or problem-solving, threats represent adverse risks that may cause irreversible failure. Entrepreneurial risk perceived negatively is represented by the surplus process R[π]...[/π] = u + ct − ∑X... − L[π].../π, with ruin defined as R[π].../π < 0 for t ≥ 0, initial capital u ≥ 0, and premium income c > 0. Within a generalized Erlang(n) inter-arrival structure, the maximum severity of ruin provides an extreme indicator of loss, enabling explicit downside risk assessment. Crossing E.I.F., E.A.L., and E.D.T. via paired extended decision analysis into a dynamic n-dimensional decision space, non-parametrics density estimation at each risk state allows real-time computation of losses through m[th]...[/th] moments of failure dividend D[u]...[/u], yielding lower bounds for optimal strategies within the Erlang(n) risk model.
Keywords
Entrepreneur Threats
Maximum severity of ruin
Controlled Surplus Process
Entrepreneurships Interests Focus (E.I.F.)
Entrepreneurships Acting Level (E.A.L.)
Entrepreneurships Dynamic Trend (E.D.T.)
A common situation in statistical computer experiments is when we have multiple models for the same phenomenon, where accuracy and computational cost vary across the different models. Kennedy and O'Hagan (2000) popularized a framework for this 'multi-fidelity problem' that relates the different models in a linear way using Gaussian processes, which was then extended to the nonlinear setting in Perdikaris et al. (2017). In this work, we further extend this framework by using normalizing flows. Normalizing flows are a method of statistical inference where we transform a simple 'base' distribution into a more complex distribution with a series of invertible and differentiable transformations (see e.g. Papamakarios et al. 2020). We show results from numerical studies showing that using normalizing flows for this problem performs well and is flexible.
Keywords
Multifidelity Modeling
Computer Experiments
Normalizing Flows
Sensor-driven anomaly detection is critical for yield and quality stability in semiconductor manufacturing. In real-world deployment, interpretability is increasingly necessary to determine which process variables drive anomalies and the timing of their occurrence. Multivariate process sensor time series are often noisy, incomplete, irregularly sampled, and high-dimensional, which complicates stable modeling and root-cause attribution. We propose an interpretable framework that smooths sensor traces with splines to form functional predictors, evaluates them on a common, feature-wise normalized time grid. Anomaly probabilities are then estimated using a variable-selecting functional logistic regression, which simultaneously identifies contributing sensors. To localize effects in time, we partition predictors and coefficient functions into predefined intervals to compute interval-wise contributions, yielding sensor-interval attributions. We demonstrate our method using some simulations with known anomaly sources and apply it to the D2 wafer-trace dataset (multivariate process sensor time series from semiconductor manufacturing) to identify contributing sensors and time intervals.
Keywords
Anomaly detection
Semiconductor manufacturing
Interpretability
Multivariate sensor time series
Functional logistic regression
Interval-wise contribution
In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release.
Keywords
Synthetic tabular data
Data Privacy
k-anonymity
Sequential sampling
Normalized mutual information
We present a new analytical method to derive a likelihood function that is marginalised over a population. This method can be used for computational advantage in the context of Bayesian hierarchical models and marginal likelihood calculations in Bayesian models. The key innovation is the specification of the necessary integrals in terms of high-order (sometimes fractional) derivatives of the population prior moment-generating function, if particular existence and differentiability conditions hold.
We confine our attention to Poisson and gamma likelihood functions. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.
We also present examples validating this new analytical method. In some of the examples, the new method is the only known analytical method to calculate the integral, giving instantaneous and accurate calculations.
Keywords
model evidences
fractional derivatives
moment-generating function
integration
Bayesian modeling
The Dirichlet distribution is a multivariate generalization of the Beta distribution, defining a family of unit sum-constrained probabilities or proportions in a multi-dimensional simplex. This distribution is usually the first choice in modeling compositional data and has been applied in various fields, including modeling microbiome data, text classification, and market share analysis. The existing literature suggests that the maximum likelihood estimator (MLE) is the most effective method for estimating Dirichlet parameters. However, asignificant issue is that simply assuming the existence and uniqueness of the MLE for the Dirichlet model parameter without an analytic proof can lead to a meaningless interpretation of its bias and/or relative mean squared error. First, we address this problem by proving the existence and uniqueness of MLEs for the general Dirichlet distribution parameters. Our method relies on a particular representation of the digamma function, and our proof is much simpler than the one by Ronning (1989). In the course of our investigation, we have also proved a conjecture left open by Ronning (1989) for the computation of the MLE, thereby bringing
Keywords
Apery’s constant
Beta function
Digamma function
Euler’s constant
Log-likelihood function
Trigamma function
We introduce a multivariate asymmetric spatial covariance model that replaces Matérn marginals with confluent hypergeometric (CH) covariance functions, providing greater flexibility in smoothness and tail behavior. Environmental processes often exhibit spatial delays, especially under the influence of prevailing wind or water flows, resulting in asymmetric cross-covariances between variables. Our construction operates within the conditional framework of Cressie and Zammit-Mangion (2016), using interaction functions to encode asymmetric cross-variable dependence. We give sufficient conditions on a class of interaction functions under which the resulting CH-based multivariate covariance is positive definite. We evaluate performance through simulation with cokriging, comparing predictive accuracy against multivariate symmetric and asymmetric Matérn-based models and a univariate CH model. We also illustrate the approach with a temperature and pressure dataset, showing improved fit and spatial prediction relative to Matérn-based alternatives.
Keywords
Spatial statistics
Multivariate spatial statistics
Asymmetric cross-covariance
Confluent hypergeometric covariance
Long-range dependence
Conditional approach
We develop a Bayesian overfitted multivariate beta mixture model for clustering aggregated ecological data bounded between 0 and 1. Such data, common in social determinants of health (SDoH) research, pose challenges for standard clustering methods due to restrictive distributional assumptions and limited interpretability. The proposed model reparameterizes the multivariate beta distribution in terms of mean and concentration parameters, enabling direct interpretation of cluster-specific profiles while accommodating skewness inherent in the data. Integrated feature saliency operates on cluster means to induce sparsity by identifying variables that meaningfully drive clustering and shrinking uninformative features toward a shared mean. An overfitted mixture formulation supports data-driven inference on the number of clusters while preserving posterior uncertainty. We assess performance through simulation studies and apply the model to neighborhood-level SDoH data from the Agency for Healthcare Research and Quality, yielding interpretable ecological clusters. The framework generalizes to a broad class of bounded, aggregated multivariate data.
Keywords
Bayesian mixture model
multivariate beta distribution
sparse modeling
ecological data
feature saliency
This study develops a statistical framework to forecast global sea-level change as a function of atmospheric carbon dioxide (CO₂) concentrations and global temperature and to conduct stress testing under alternative climate policy scenarios. Three scenarios are considered: a) an expected scenario reflecting current emission trends, b) a best-case scenario assuming compliance with Kyoto Protocol CO₂ reduction targets, and c) a worst-case scenario assuming CO₂ emissions increase at a rate opposite to the Kyoto targets.
The analysis employs a three-stage modeling approach based on Seasonal Autoregressive Integrated Moving Average models with exogenous variables (SARIMAX). In the first stage, CO₂ dynamics are modeled using a univariate SARIMAX specification. In the second stage, global temperature is modeled with lagged temperature and CO₂ as an exogenous predictor. In the final stage, sea level is modeled as a function of its own dynamics and lagged global temperature. The estimated models are used to generate sea-level projections under the three scenarios.
The results indicate significant sea-level rise under the expected scenario, stabilization under the best-case scenario
Keywords
stress testing
SARIMAX
global warming
predictive modeling
temperature, CO2
sea level
We propose a tensor Bayesian copula factor autoregressive model with multivariate responses for analyzing mixed-type time series data with both main effects and interactions. The model is motivated by the need to study dynamic relationships between macroeconomic variables and stock market indices, leading naturally to tensor-valued posterior distributions. Dependence is captured through latent factors in both the multivariate response time series and high-dimensional mixed-type covariates within a quadratic time series regression framework coupled with copula functions. To enhance computational efficiency, we employ a semiparametric extended rank likelihood for the marginal distributions of the covariates, substantially reducing parameter dimensionality. Posterior inference is performed using Metropolis–Hastings and Forward Filtering Backward Sampling algorithms embedded in a Gibbs sampling scheme. The effectiveness of the proposed methodology is demonstrated through extensive simulation studies and an application to a real macroeconomic dataset.
Keywords
Multivariate tensor time series
Bayesian inference
Factor analysis
Copula models
Depression is increasing among adolescents and young adults. Identifying clinically meaningful subtypes of depression and comorbidity patterns in youth is therefore a critical priority. Topic modeling of electronic health records (EHR) offers a promising strategy to uncover latent depression phenotypes and patient subtypes underlying diagnostic co-occurrence. Yet two analytical barriers remain: existing methods often lack the efficiency needed for large EHR datasets with high-dimensional features, and the smaller, sparser records common in youth populations hinder estimation stability. To overcome these challenges, we propose a transfer topic modeling approach that integrates the computationally efficient Topic-SCORE algorithm with a ridge-type estimator to stabilize latent subspace estimation and improve topic matrix recovery. Simulation studies show our method outperforms models trained only on the target population and alternative transfer learning approaches. Leveraging the All of Us Research Program, our method identifies seven clinically meaningful latent structures of youth depression and distinguishes subgroups at elevated risk for suicidal thoughts and behaviors.
Keywords
Electronic health records
Representation learning
Transfer learning
Mental health
Suicide risk