Reinforcing data from controlled experiments with synthetic data to increase predictive accuracy

Lochana Palayangoda Co-Author
University of Nebraska Omaha
 
Jason Parcon First Author
PepsiCo
 
Jason Parcon Presenting Author
PepsiCo
 
Monday, Aug 4: 2:50 PM - 3:05 PM
2709 
Contributed Papers 
Music City Center 
Product development teams collect data from controlled experiments to optimize products or processes, e.g., ingredient levels of products are optimized for maximum consumer appeal, or process settings are optimized for maximum yield. Physical experiments can be sometimes costly, hence, designs that deliver a minimal number of runs (e.g., D-Optimal designs) are often used. Such designs, however, may not provide adequate coverage of certain parts of the input space which may impact a model's predictive performance. To this end, this paper explores the use of synthetically generated data as reinforcement to real data to enhance predictive performance. The synthetic data points are designed to provide better coverage of the input space while preserving key statistical properties of the original data. A specific use case that showed notable improvements in predictive performance will be presented: RMSE on a held-out test set markedly decreased when comparing models trained on real data alone versus models trained on the combined real+synthetic data. This approach allows for a more comprehensive exploration of the input space without the need to physically collect more data.

Keywords

Synthetic Data

Prediction Accuracy

Controlled Experiments 

Main Sponsor

Section on Statistical Consulting