Probably Approximately Correct Labels
Tuesday, Aug 5: 9:45 AM - 10:15 AM
Invited Paper Session
Music City Center
Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method to reduce this cost by using AI predictions where they are confident and collecting expert labels only where needed. Our procedure outputs a labeled dataset with a probably approximately correct (PAC) guarantee: with high probability, the labeling error is small. This approach enables rigorous, cost-effective dataset curation using modern AI models. We demonstrate the benefits of the methodology via text annotation with large language models, image labeling with pre-trained vision models, and studying protein folding with AlphaFold.
black-box machine learning
statistical inference
You have unsaved changes.