The Annals of Applied Statistics Invited Session

Ji Zhu Chair
University of Michigan
 
Bradley Efron Discussant
Stanford University
 
Xihong Lin Discussant
Harvard T.H. Chan School of Public Health
 
Susan Murphy Discussant
Harvard University
 
Wing-Hung Wong Discussant
Stanford University
 
Bin Yu Discussant
University of California at Berkeley
 
Hongyu Zhao Discussant
Yale University
 
Ji Zhu Organizer
University of Michigan
 
Monday, Aug 4: 10:30 AM - 12:20 PM
0428 
Invited Paper Session 
Music City Center 
Room: CC-Davidson Ballroom B 

Applied

Yes

Main Sponsor

IMS

Presentations

Boosting Data Analytics with Synthetic Volume Expansion

Synthetic data generation heralds a paradigm shift in data science, addressing the challenges of data scarcity and privacy and enabling unprecedented performance. As synthetic data gains prominence, questions arise regarding the accuracy of statistical methods compared to their application on raw data alone. Addressing this, we introduce the Synthetic Data Generation for Analytics framework, which applies statistical methods to high-fidelity synthetic data produced by advanced generative models like tabular diffusion models through knowledge transfer. These models, trained using raw data, are enriched with insights from relevant studies. A significant finding within this framework is the generational effect: the error of a statistical method initially decreases with the integration of synthetic data but may subsequently increase. This phenomenon, rooted in the complexities of replicating raw data distributions, introduces the "reflection point," an optimal threshold of synthetic data defined by specific error metrics. Through one data example, we demonstrate the effectiveness of this framework.

 

Keywords

Generative Machine Intelligence

Large Language Models

Knowledge Transfer

Pretrained Transformers

Tabular Diffusion

Unstructured 

Speaker

Xiaotong Shen, University of Minnesota