Print Close

Reinforcing data from controlled experiments with synthetic data to increase predictive accuracy

Presented During: Bridging Statistical Theory and Practice: Tools and Techniques for Effective Consulting and Collaboration

Lochana Palayangoda Co-Author
University of Nebraska Omaha

Jason Parcon First Author
PepsiCo

Jason Parcon Presenting Author
PepsiCo

Monday, Aug 4: 2:50 PM - 3:05 PM
2709
Contributed Papers

Music City Center

Product development teams collect data from controlled experiments to optimize products or processes, e.g., ingredient levels of products are optimized for maximum consumer appeal, or process settings are optimized for maximum yield. Physical experiments can be sometimes costly, hence, designs that deliver a minimal number of runs (e.g., D-Optimal designs) are often used. Such designs, however, may not provide adequate coverage of certain parts of the input space which may impact a model's predictive performance. To this end, this paper explores the use of synthetically generated data as reinforcement to real data to enhance predictive performance. The synthetic data points are designed to provide better coverage of the input space while preserving key statistical properties of the original data. A specific use case that showed notable improvements in predictive performance will be presented: RMSE on a held-out test set markedly decreased when comparing models trained on real data alone versus models trained on the combined real+synthetic data. This approach allows for a more comprehensive exploration of the input space without the need to physically collect more data.

Keywords

Synthetic Data

Prediction Accuracy

Controlled Experiments

Main Sponsor

Section on Statistical Consulting