Statistical Methods for Artificial Intelligence

Xiwei Tang Chair
University of Virginia
 
Tracy Ke Organizer
Harvard University
 
Monday, Aug 4: 10:30 AM - 12:20 PM
0549 
Invited Paper Session 
Music City Center 
Room: CC-Dean Grand Ballroom A1 

Applied

Yes

Main Sponsor

International Chinese Statistical Association

Co Sponsors

JASA Theory and Methods
Section on Text Analysis

Presentations

Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models

Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling. We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of K base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this method is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications. Assuming each topic is a β-Hölder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches. 

Speaker

Tracy Ke, Harvard University

Heterogeneity-Aware Synthetic Data Generation for Tabular Data

Recent advances in generative AI, such as diffusion models, have revolutionized data generation across various domains, including computational biology, medical research, the social sciences, and beyond. Yet, while image and text synthesis have seen remarkable progress, generating realistic tabular data —a common format in statistical applications —remains a significant challenge. Tabular datasets often involve mixed-type features (continuous, categorical, ordinal), complex inter-feature dependencies, and pronounced heterogeneity across individuals or subpopulations, making classical generative models ill-suited for statistical inference tasks. We introduce a novel diffusion-based framework designed specifically for synthetic tabular data generation, which incorporates feature-adaptive diffusion dynamics and subgroup-aware conditioning, explicitly addressing heterogeneity at both the variable and population levels. This enables our model to better capture local structure and dependence patterns essential for downstream statistical tasks. Through empirical studies, we demonstrate the model's strong performance across a variety of benchmarks and its value in applications such as missing data imputation, data augmentation, and downstream inference. The framework offers a promising pathway for bridging modern generative AI with classical statistical needs, given its ability to serve as an anonymized proxy for real datasets and power effective learning on downstream tasks. 

Speaker

Xiwei Tang, University of Virginia

Optimal estimators and tests for reciprocal effects

The p1 model plays a fundamental role in modeling directed networks, where the reciprocal effect parameter ρ is of special interest in practice. However, due to nonlinear factors in this model, how to estimate ρ efficiently is a long-standing open problem. We tackle the problem by the cycle count approach. The challenge is, due to the nonlinear factors in the model, for any given type of generalized cycles, the expected count is a complicated function of many parameters in the model, so it is unclear how to use cycle counts to estimate ρ. However, somewhat surprisingly, we discover that, among many types of generalized cycles with the same length, we can carefully pick a pair of them such that in the ratio between the expected cycle counts of the two types, the non-linear factors cancel out nicely with each other, and as a result, the ratio equals to exp(ρ) exactly. Therefore, though the expected count of cycles of any type is not tractable, the ratio between the expected cycle counts of a (carefully chosen) pair of generalized cycles may have an utterly simple form. We study to what extent such pairs exist, and use our discovery to derive both an estimate for ρ and a testing procedure for testing ρ = ρ0. In a setting where we allow a wide range of reciprocal effects and a wide variety of network sparsity and degree heterogeneity, we show that our estimator achieves the optimal rate and our test achieves the optimal phase transition.  

Speaker

Jiashun Jin, Carnegie Mellon University

R-Squared From Synthetic Data: When Can It Be Trusted?

Synthetic data are gaining in popularity for a variety of reasons -- ranging from protecting privacy to reducing computation costs -- making it more urgent to address the question about the reliability of using synthetic data for assessing the real-world efficacy of a prediction algorithm. This article outline a comparative framework that takes into account (1) the relationship between the synthetic data D and the (potentially counterfactual) benchmark data D*, which is perceived as a reasonable representation of reality; and (2) the relationship between how the algorithm interacts with D and how it interacts with D*. We propose measures of target syntheticity (or more broadly proximity) and residual syntheticity/proximity, and provide a simple decomposition of the benchmark R-squared into the synthetic R-squared and a syntheticity-impact score, which quantifies the difference between the residual and target syntheticities relative to the residual syntheticity alone. We show that the synthetic R-squared is typically asymptotically conservative whenever the synthetic data are created by injecting additive noise to the target variable, such as in differential privacy, and we provide a computable adjustment for safely correcting the conservativeness in the synthetic R-squared in such cases. Additionally, we establish a necessary and sufficient condition for the residual syntheticity to exceed one, which implies a conservative synthetic R-squared when the target variable is not synthesized. We apply these theoretical insights to a proxy study, investigating the prediction of ground-level features from Earth observations in cases where the locations of these features have been synthetically perturbed to protect the data-providers' privacy. (This is joint work with James Bailie, Mohammad Kakooei, and Adel Daoud of AI and Global Development Lab at Chalmers University of Technology in Sweden.)  

Speaker

Xiao-Li Meng, Harvard University