Statistical Measures for Evaluating Quality of LLM Performance in NLU Tasks

Grace Deng Speaker
Google
 
Wednesday, Aug 6: 10:35 AM - 10:55 AM
Topic-Contributed Paper Session 
Music City Center 
Modern natural language processing (NLP) applications often rely on large language models (LLMs) to automate tasks previously requiring human input. Given the high cost of obtaining ground-truth labels, LLMs have been recently used as a proxy model for human ratings (e.g., AutoRater), which can take on the form of labels, preferences, or feedback. However, it can be challenging to fully evaluate LLMs' performance in NLP or Natural Language Understanding (NLU) task settings. We investigate using statistical measures of agreement and evaluate their potential for assessing the general quality of LLMs for text analysis and inference.

Keywords

test