Creating Something from Nothing? Synthetic Data Analysis for Social Good and Policy Making

Emily Hector Chair
North Carolina State University
 
Dungang Liu Organizer
University of Cincinnati
 
Sunday, Aug 3: 2:00 PM - 3:50 PM
0320 
Invited Paper Session 
Music City Center 
Room: CC-201A 

Applied

Yes

Main Sponsor

Social Statistics Section

Co Sponsors

Government Statistics Section
Health Policy Statistics Section

Presentations

Golden Ratio Weighting Prevents Model Collapse

In recent years, synthetic data have been widely used to train generative models such as large language models. This trend is mainly motivated by the limited availability of data to train larger models due to neural scaling laws. However, over successive training iterations, trained generative models gradually lose information about the real data distribution, a phenomenon known as model collapse.

In this talk, we investigate this phenomenon theoretically by training generative models iteratively on a combination of newly collected real data and synthetic data from the previous training step. We conduct our theoretical studies in various scenarios, including Gaussian distribution estimation and linear regression. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression. Notably, in some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio.
 

Co-Author

Guang Cheng, University of California, Los Angeles

Speaker

Hengzhi He

Creating, evaluating, and sharing synthetic data for multinational HIV cohorts

Open science often clashes with data privacy laws and regulations, especially with respect to sharing health data. Synthetic data offers a viable middle-ground solution to enable sharing data that resemble the original data while mitigating privacy concerns. We present our experience generating synthetic datasets for the Caribbean, Central and South America network for HIV epidemiology (CCASAnet), a large (n~70,000) observational cohort of people living with HIV throughout Latin America. We describe various methods for fitting and generating data, including generative adversarial network (GAN) and diffusion probabilistic model techniques. We discuss challenges encountered, including handling missing data and rare events. We evaluate the utility of our synthetic data by assessing its extrinsic performance – i.e., its ability to yield similar results to the original data when applying analyses that are independent of the data generation process.

 

Speaker

Bryan Shepherd, Vanderbilt University, School of Medicine

An "i-Mobility" framework for studying social mobility: individualized inference via generative analysis of discrete data

Social mobility refers to the ability of individuals or groups to move within a social hierarchy. Interest in social mobility has grown over the past decades due to rising concerns over educational disparities and intergenerational persistence of poverty. The existing literature primarily investigates this issue by focusing on rough demographic groups (e.g., race, gender, or country), which may overlook importance characteristic nuances between individuals. In this work, we establish an "i-Mobility" framework that allows us to study social mobility at the individual level. Specifically, given a predefined profile with a combination of individual characteristics, our framework can provide a measure that reflects the mobility of this "individual". The analysis of the well-regarded General Social Survey (GSS) shows that our framework is more robust than the traditional group-focused methods for social mobility. Moreover, our framework can capture heterogeneity at the individual level and distinguish between different profiles by considering more nuanced personal characteristics.  

Co-Author(s)

Jiawei Huang, Carl H. Lindner College of Business, University of Cincinnati
Yuan Jiang, Oregon State University
Yu Xie, Princeton University

Speaker

Dungang Liu, University of Cincinnati