Sunday, Aug 2: 2:00 PM - 3:50 PM
7103
Contributed Speed
Thomas M. Menino Convention & Exhibition Center
Room: CC-102A
Presentations
This entry to the 2026 Data Expo Challenge will use the What Works Clearinghouse database explore what works in education. Both static and interactive graphics will be used to communicate the results.
Despite widespread investment in attendance-targeted interventions, chronic absenteeism has doubled in U.S. public schools since 2018 and remains elevated. We argue the field has been treating a symptom rather than the underlying condition. Using the What Works Clearinghouse as an analytical dataset, we examine whether interventions designed for attendance outperform engagement-oriented programs that reduce absenteeism as a secondary effect. We then test whether this pattern holds under real-world conditions by studying how economic shocks — events that suddenly disrupt family stability and student engagement — differentially affect districts depending on which type of program they had in place. Our findings point toward a reorientation of both research and policy away from attendance monitoring and toward engagement as the more promising lever for keeping students in school.
Keywords
The Annual Data Challenge Expo
Evidence-based education policy increasingly relies on systematic reviews to inform the adoption and scaling of instructional programs. We examine heterogeneity in educational intervention effectiveness using the What Works Clearinghouse dataset maintained by the U.S. Department of Education. To address the question of "what works, for whom, and under what conditions," we analyze intervention–protocol combinations across outcome domains and student subgroups. In contrast to prior work that emphasizes average intervention effects, our analysis focuses on subgroup-specific effectiveness across populations, geographic regions, educational contexts, and implementation conditions. We leverage hierarchical models to account for findings nested within studies and interventions, and integrate external data from the National Center for Education Statistics and the U.S. Census Bureau to incorporate broader educational and demographic context. By comparing overall and subgroup-specific results, we identify interventions with broad effectiveness as well as those whose impacts are concentrated in particular populations or settings.
Keywords
Subgroup Analysis
Hierarchical Modeling
Causal Inference
Online consumers provide rich, unstructured textual data to express their satisfaction and dissatisfaction. It has motivated researchers to investigate how to systematically process, analyze, and extract meaningful patterns hidden in the large volumes of unstructured text. This study combines a fine-tuned local large language model (AI) with sentiment analysis and LDA-based topic modeling of online reviews to identify what customers truly care about and which factors contribute to positive feedback. This novel collaborative approach improves the interpretability and readability of the large volume of comments, no longer limiting them to explicit sentiment classification before sentiment analysis, and generates concise, readable summary sentences. Using customer reviews as a case, results reveal 10 themes are prominently featured in positive reviews. These findings suggest that both accommodation comfort and positive host–guest interactions increase the likelihood of positive reviews.
Keywords
Sentiment Analysis
Large Language Model, Fine Tuned AI, Local LLM
Artificial Intelligence
Latent Dirichlet
Multinomial Logistic Regression
Topic Modeling
Causal inference for single treatment effect estimation is challenging due to the absence of valid control units. The synthetic control method (SCM) offers an innovative way of constructing the so-called data-driven control unit. The generalized synthetic control (GSC) method is proposed as a factor model-based extension of SCM. While GSC improves upon SCM, the performance of GSC heavily depends on the choice of the number of latent factors. To account for the uncertainty associated with the number of factors, we propose to employ a Bayesian infinity factor modeling approach. The key idea of our Bayesian infinity factor modeling is to assign a cumulative shrinkage process prior on the factor loadings. In addition, we apply a Gaussian process approach to infer the non-linear treatment effect. The proposed Bayesian framework enables us to make full Bayesian inference about the time-varying treatment effect. The merits of the proposed Bayesian method are demonstrated through simulation studies and real data analysis.
Keywords
Bayesian infinity factor model
Causal inference
Interactive fixed effects model
Synthetic control method
High-dimensional biomedical data pose challenges such as multicollinearity, small sample sizes, and instability in variable selection. Feature selection is crucial for interpretability, reproducibility, and robust statistical learning. Traditional Bayesian methods, such as SSVS and spike-and-slab priors, often depend on precise distributional assumptions and are sensitive to prior choices, limiting their stability and transparency in prioritizing variables. We propose BRISK, a rank-based Bayesian feature selection framework utilizing robust permutation kernels, which unifies feature-specific evidence into stable rankings based on association strength and consistency across modeling scenarios. Unlike binary selection, BRISK generates a prioritized list, aiding validation and expert review. Empirical results on simulated and real omics data demonstrate BRISK's superior stability, reduced false discoveries, and improved predictive accuracy compared to traditional methods, especially when predictors are correlated or samples are limited. BRISK offers a reliable, interpretable approach for high-dimensional biomedical feature selection.
Keywords
Feature selection
High-dimensional data
Rank-based framework
Bayesian methods
Stability
Omics data application
Factor analysis methods are widely used for exploring latent characteristics of a random process. Spatio-temporal factor analysis extends this approach to account for spatial and temporal dependencies in spatio-temporal data. By modeling these dependencies, the estimated processes can be interpolated to unobserved locations. However, Bayesian spatio-temporal models are notoriously computationally burdensome, limiting their practical use. To address this, we developed the BSTFA package in R to automatically fit an efficient Bayesian spatio-temporal factor analysis model using dimension-reduced basis functions. The BSTFA package is user-friendly, computationally fast, and provides a powerful tool for modeling and interpreting spatio-temporal dependencies. We demonstrate its utility with a case study modeling PM 2.5 levels across the state of California for 25 years.
Keywords
Bayesian modeling
R Package
Spatio-temporal data
MCMC
Latent analysis
Basis functions
Archetypal analysis (AA) is an interpretable, geometry-based unsupervised learning method that has been extended to functional data, where observations are represented as curves. In many industrial applications, such as manufacturing process monitoring, sensor measurements are inherently functional, making anomaly detection a critical task for monitoring and control. AA-based representations summarize functional observations using archetype coefficients, providing compact and interpretable features for functional anomaly detection. However, existing AA-based approaches often rely on heuristic and user-dependent criteria to identify anomalies, which can limit reproducibility and robustness. Motivated by characteristic distributional gaps observed in archetype coefficient spaces, we propose a density-based anomaly detection framework that identifies anomalies as observations located in low-density regions. Anomaly scores are defined using kernel density estimation, and decision thresholds are determined automatically by contamination-based quantiles. The proposed method is evaluated using simulation studies and an application to real semiconductor manufacturing process sensor data.
Keywords
Archetypal analysis
Functional data
Manufacturing process
Anomaly detection
Density estimation
Decision threshold
Entrepreneurial threats are extreme realizations of investment risk, materializing as ruin events in the surplus dynamics of entrepreneurial or investment activities. Across social, economic, or institutional Entrepreneurships Interests Focus (E.I.F.), with Acting Levels (E.A.L.) Micro, Meso, or Macro, and Dynamic Trends (E.D.T.) innovation, impact, or problem-solving, threats represent adverse risks that may cause irreversible failure. Entrepreneurial risk perceived negatively is represented by the surplus process R[π]...[/π] = u + ct − ∑X... − L[π].../π, with ruin defined as R[π].../π < 0 for t ≥ 0, initial capital u ≥ 0, and premium income c > 0. Within a generalized Erlang(n) inter-arrival structure, the maximum severity of ruin provides an extreme indicator of loss, enabling explicit downside risk assessment. Crossing E.I.F., E.A.L., and E.D.T. via paired extended decision analysis into a dynamic n-dimensional decision space, non-parametrics density estimation at each risk state allows real-time computation of losses through m[th]...[/th] moments of failure dividend D[u]...[/u], yielding lower bounds for optimal strategies within the Erlang(n) risk model.
Keywords
Entrepreneur Threats
Maximum severity of ruin
Controlled Surplus Process
Entrepreneurships Interests Focus (E.I.F.)
Entrepreneurships Acting Level (E.A.L.)
Entrepreneurships Dynamic Trend (E.D.T.)
A common situation in statistical computer experiments is when we have multiple models for the same phenomenon, where accuracy and computational cost vary across the different models. Kennedy and O'Hagan (2000) popularized a framework for this 'multi-fidelity problem' that relates the different models in a linear way using Gaussian processes, which was then extended to the nonlinear setting in Perdikaris et al. (2017). In this work, we further extend this framework by using normalizing flows. Normalizing flows are a method of statistical inference where we transform a simple 'base' distribution into a more complex distribution with a series of invertible and differentiable transformations (see e.g. Papamakarios et al. 2020). We show results from numerical studies showing that using normalizing flows for this problem performs well and is flexible.
Keywords
Multifidelity Modeling
Computer Experiments
Normalizing Flows
Sensor-driven anomaly detection is critical for yield and quality stability in semiconductor manufacturing. In real-world deployment, interpretability is increasingly necessary to determine which process variables drive anomalies and the timing of their occurrence. Multivariate process sensor time series are often noisy, incomplete, irregularly sampled, and high-dimensional, which complicates stable modeling and root-cause attribution. We propose an interpretable framework that smooths sensor traces with splines to form functional predictors, evaluates them on a common, feature-wise normalized time grid. Anomaly probabilities are then estimated using a variable-selecting functional logistic regression, which simultaneously identifies contributing sensors. To localize effects in time, we partition predictors and coefficient functions into predefined intervals to compute interval-wise contributions, yielding sensor-interval attributions. We demonstrate our method using some simulations with known anomaly sources and apply it to the D2 wafer-trace dataset (multivariate process sensor time series from semiconductor manufacturing) to identify contributing sensors and time intervals.
Keywords
Anomaly detection
Semiconductor manufacturing
Interpretability
Multivariate sensor time series
Functional logistic regression
Interval-wise contribution
In sensitive domains like healthcare, synthetic data can replace releasing microdata, aiming to match key statistics while reducing re-identification and attribute-inference risk. Yet generators may still emit rare patterns or near-duplicates. We propose a k–anonymity–aware sequential sampling approach that generates each synthetic record by sampling variables sequentially from histogram-based empirical conditional distributions. At each step, the method restricts candidate values so that the resulting partial pattern (the projection onto the variables already sampled) has at least k matching records in the original dataset. When the conditioning context is too sparse, we relax the conditioning set in a dependence-guided manner, dropping variables weakly related to the variable currently being sampled (e.g., ranked by normalized mutual information) while retaining the minimum-frequency requirement. Overall, k serves as a transparent, user-specified control over the minimum frequency of generated patterns, while dependence-guided relaxation can help preserve useful multivariate structure, supporting a practical balance between fidelity and privacy in synthetic data release.
Keywords
Synthetic tabular data
Data Privacy
k-anonymity
Sequential sampling
Normalized mutual information
We present a new analytical method to derive a likelihood function that is marginalised over a population. This method can be used for computational advantage in the context of Bayesian hierarchical models and marginal likelihood calculations in Bayesian models. The key innovation is the specification of the necessary integrals in terms of high-order (sometimes fractional) derivatives of the population prior moment-generating function, if particular existence and differentiability conditions hold.
We confine our attention to Poisson and gamma likelihood functions. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.
We also present examples validating this new analytical method. In some of the examples, the new method is the only known analytical method to calculate the integral, giving instantaneous and accurate calculations.
Keywords
model evidences
fractional derivatives
moment-generating function
integration
Bayesian modeling
The Dirichlet distribution is a multivariate generalization of the Beta distribution, defining a family of unit sum-constrained probabilities or proportions in a multi-dimensional simplex. This distribution is usually the first choice in modeling compositional data and has been applied in various fields, including modeling microbiome data, text classification, and market share analysis. The existing literature suggests that the maximum likelihood estimator (MLE) is the most effective method for estimating Dirichlet parameters. However, asignificant issue is that simply assuming the existence and uniqueness of the MLE for the Dirichlet model parameter without an analytic proof can lead to a meaningless interpretation of its bias and/or relative mean squared error. First, we address this problem by proving the existence and uniqueness of MLEs for the general Dirichlet distribution parameters. Our method relies on a particular representation of the digamma function, and our proof is much simpler than the one by Ronning (1989). In the course of our investigation, we have also proved a conjecture left open by Ronning (1989) for the computation of the MLE, thereby bringing
Keywords
Apery’s constant
Beta function
Digamma function
Euler’s constant
Log-likelihood function
Trigamma function
We develop a bivariate asymmetric spatial covariance model using the conditional approach of Cressie and Zammit-Mangion (2016). Our construction includes both a simple pointwise dependence interaction and an asymmetric dependence interaction, while ensuring that both marginal covariance functions are from the same family such as the Matérn class. Unlike the conditional formulation in Cressie and Zammit-Mangion (2016), in which the same covariance family is imposed on only one marginal and a conditional component, our approach preserves the same marginals for both observed variables, aiding interpretation, comparison with familiar univariate spatial models, and model specification in practice. We derive sufficient conditions under which the resulting bivariate covariance function is positive definite. Through simulation studies with cokriging, we compare the proposed models with existing symmetric and asymmetric Matérn-based alternatives and assess predictive performance. We further illustrate the superior performance of the proposed models with a spatial temperature-pressure dataset.
Keywords
Geostatistics
Asymmetry
Matérn covariance
Cross-covariance
Conditional approach
We develop a Bayesian overfitted multivariate beta mixture model for clustering aggregated ecological data bounded between 0 and 1. Such data, common in social determinants of health (SDoH) research, pose challenges for standard clustering methods due to restrictive distributional assumptions and limited interpretability. The proposed model reparameterizes the multivariate beta distribution in terms of mean and concentration parameters, enabling direct interpretation of cluster-specific profiles while accommodating skewness inherent in the data. Integrated feature saliency operates on cluster means to induce sparsity by identifying variables that meaningfully drive clustering and shrinking uninformative features toward a shared mean. An overfitted mixture formulation supports data-driven inference on the number of clusters while preserving posterior uncertainty. We assess performance through simulation studies and apply the model to neighborhood-level SDoH data from the Agency for Healthcare Research and Quality, yielding interpretable ecological clusters. The framework generalizes to a broad class of bounded, aggregated multivariate data.
Keywords
Bayesian mixture model
multivariate beta distribution
sparse modeling
ecological data
feature saliency
This study develops a statistical framework to forecast global sea-level change as a function of atmospheric carbon dioxide (CO₂) concentrations and global temperature and to conduct stress testing under alternative climate policy scenarios. Three scenarios were considered: a) an expected scenario reflecting current emission trends, b) a best-case scenario assuming compliance with Kyoto Protocol CO₂ reduction targets, and c) a worst-case scenario assuming CO₂ emissions increase at a rate opposite to the Kyoto Protocol targets.
The analysis employs a three-stage modeling approach based on Seasonal Autoregressive Integrated Moving Average models with exogenous variables (SARIMAX). In the first stage, CO₂ dynamics are modeled using a univariate SARIMAX specification. In the second stage, global temperature is modeled with lagged temperature and CO₂ as an exogenous predictor. In the final stage, the sea level is modeled as a function of its own dynamics and lagged global temperature.
Results indicate continued sea-level rise under the expected scenario, partial stabilization under the best-case scenario, and accelerated increases under the worst-case scenario. The proposed methodology demonstrates how classical time-series methods can be used for climate stress testing and policy analysis.
Keywords
stress testing
SARIMAX
global warming
predictive modeling
temperature, CO2
sea level
We propose a tensor Bayesian copula factor autoregressive model with multivariate responses for analyzing mixed-type time series data with both main effects and interactions. The model is motivated by the need to study dynamic relationships between macroeconomic variables and stock market indices, leading naturally to tensor-valued posterior distributions. Dependence is captured through latent factors in both the multivariate response time series and high-dimensional mixed-type covariates within a quadratic time series regression framework coupled with copula functions. To enhance computational efficiency, we employ a semiparametric extended rank likelihood for the marginal distributions of the covariates, substantially reducing parameter dimensionality. Posterior inference is performed using Metropolis–Hastings and Forward Filtering Backward Sampling algorithms embedded in a Gibbs sampling scheme. The effectiveness of the proposed methodology is demonstrated through extensive simulation studies and an application to a real macroeconomic dataset.
Keywords
Multivariate tensor time series
Bayesian inference
Factor analysis
Copula models
Depression is increasing among adolescents and young adults. Identifying clinically meaningful subtypes of depression and comorbidity patterns in youth is therefore a critical priority. Topic modeling of electronic health records (EHR) offers a promising strategy to uncover latent depression phenotypes and patient subtypes underlying diagnostic co-occurrence. Yet two analytical barriers remain: existing methods often lack the efficiency needed for large EHR datasets with high-dimensional features, and the smaller, sparser records common in youth populations hinder estimation stability. To overcome these challenges, we propose a transfer topic modeling approach that integrates the computationally efficient Topic-SCORE algorithm with a ridge-type estimator to stabilize latent subspace estimation and improve topic matrix recovery. Simulation studies show our method outperforms models trained only on the target population and alternative transfer learning approaches. Leveraging the All of Us Research Program, our method identifies seven clinically meaningful latent structures of youth depression and distinguishes subgroups at elevated risk for suicidal thoughts and behaviors.
Keywords
Electronic health records
Representation learning
Transfer learning
Mental health
Suicide risk