Heterogeneity-Aware Synthetic Data Generation for Tabular Data

Xiwei Tang Speaker
University of Virginia
 
Monday, Aug 4: 11:00 AM - 11:25 AM
Invited Paper Session 
Music City Center 
Recent advances in generative AI, such as diffusion models, have revolutionized data generation across various domains, including computational biology, medical research, the social sciences, and beyond. Yet, while image and text synthesis have seen remarkable progress, generating realistic tabular data —a common format in statistical applications —remains a significant challenge. Tabular datasets often involve mixed-type features (continuous, categorical, ordinal), complex inter-feature dependencies, and pronounced heterogeneity across individuals or subpopulations, making classical generative models ill-suited for statistical inference tasks. We introduce a novel diffusion-based framework designed specifically for synthetic tabular data generation, which incorporates feature-adaptive diffusion dynamics and subgroup-aware conditioning, explicitly addressing heterogeneity at both the variable and population levels. This enables our model to better capture local structure and dependence patterns essential for downstream statistical tasks. Through empirical studies, we demonstrate the model's strong performance across a variety of benchmarks and its value in applications such as missing data imputation, data augmentation, and downstream inference. The framework offers a promising pathway for bridging modern generative AI with classical statistical needs, given its ability to serve as an anonymized proxy for real datasets and power effective learning on downstream tasks.