Print Close

Quantifying and Correcting for Model Space Bias from AI-Synthesized Data in Streaming Data

Presented During: When Many Models Are Equally Accurate: Exploring Interpretability and Uncertainty Quantification

Kentaro Hoffman Co-Author
University of Washington Department of Statistics

Tyler McCormick Co-Author
University of Washington

Kentaro Hoffman Speaker
University of Washington Department of Statistics

Wednesday, Aug 6: 9:25 AM - 9:50 AM
Invited Paper Session

Music City Center

Hoffman et al. (2024) investigate how the inclusion of synthetic AI or ML-generated data can bias the space of feasible models, potentially leading to erroneous downstream decision-making. This work demonstrates how to quantify and correct for this bias through the inclusion of small amounts of real data with a correction factor from the framework of Inference on Predicted Data (IPD). With this procedure, we demonstrate how to get valid statistical inference in the context of streaming data even when much of the data is machine biased. Furthermore, Bayesian optimal experimental design leveraged to define the optimal sample sizes of real and synthetic data to best control the space of feasible models.

Keywords

Artificial Intelligence

Inference on Predicted Data

Statistical Inference

Streaming Data

Bayesian Optimal Experimental Design