Quantifying and Correcting for Model Space Bias from AI-Synthesized Data in Streaming Data

Kentaro Hoffman Co-Author
University of Washington Department of Statistics
 
Tyler McCormick Co-Author
University of Washington
 
Kentaro Hoffman Speaker
University of Washington Department of Statistics
 
Wednesday, Aug 6: 9:25 AM - 9:50 AM
Invited Paper Session 
Music City Center 
Hoffman et al. (2024) investigate how the inclusion of synthetic AI or ML-generated data can bias the space of feasible models, potentially leading to erroneous downstream decision-making. This work demonstrates how to quantify and correct for this bias through the inclusion of small amounts of real data with a correction factor from the framework of Inference on Predicted Data (IPD). With this procedure, we demonstrate how to get valid statistical inference in the context of streaming data even when much of the data is machine biased. Furthermore, Bayesian optimal experimental design leveraged to define the optimal sample sizes of real and synthetic data to best control the space of feasible models.

Keywords

Artificial Intelligence

Inference on Predicted Data

Statistical Inference

Streaming Data

Bayesian Optimal Experimental Design