Testing and Evaluating foundation models in high consequence scenarios
Monday, Aug 4: 11:50 AM - 12:15 PM
Invited Paper Session
Music City Center
Foundation models have had a profound impact on society, through models such as OpenAI's chatGPTs and Anthropic's Claude series, as well as science, through models such as AlphaFold, Climax, and Aurora. While these models can produce impressive output, less focus has been spent on model evaluation than model building. In this talk I will discuss some of the challenges that make testing and evaluating large models difficult and efforts to more systematically evaluate them, such as uncertainty quantification on metrics and predictions, holistic metrics that go beyond leaderboardism, where models are ranked and compared with a single value, and deterministic evaluation of an LLM's output probability distribution.
Testing and evaluation
uncertainty quantification
AI models
You have unsaved changes.