Evaluating LLM Applications Like a Statistician

Spencer Carter Instructor
Travelers Insurance
 
Tuesday, Aug 5: 1:00 PM - 5:00 PM
CE_26 
Professional Development Course/CE 
Music City Center 
Room: CC-108 
The public and private sector alike are putting great emphasis on creating Artificial Intelligence systems backed by Large Language Models (LLMs). Because LLMs are effective at generalization with few or zero-shot learning, it has become incredibly easy to build LLM-driven applications with little more than an API call. This ease has led some to measure these systems qualitatively on a few examples—the colloquial "vibe check". Yet these are non-deterministic models with an infinite input space, for which evaluation is not trivial. This is a glowing opportunity for statisticians to play key roles in the evaluation and measurement of these systems.

This course will seek to introduce statisticians to the common metrics and frameworks used to evaluate LLM systems (e.g. BLEU, ROGUE, G-Eval, RAGAS, etc.), using Retrieval Augmented Generation (RAG) as an application use-case. We will evaluate a RAG system using these metrics, then measure the impact of changes to the system's prompt and configuration. We will finish with a statistical analysis to compare these metrics against human evaluator's assessment of our RAG system.

Prior experience with LLMs is not needed—we will largely treat the models as black boxes, focusing instead on measuring the performance of the application holistically.