Print Close

Statistical Measures for Evaluating Quality of LLM Performance in NLU Tasks

Presented During: How Product Thinking Shapes Methodological Innovation at Google

Grace Deng Speaker
Google

Wednesday, Aug 6: 10:35 AM - 10:55 AM
Topic-Contributed Paper Session

Music City Center

Modern natural language processing (NLP) applications often rely on large language models (LLMs) to automate tasks previously requiring human input. Given the high cost of obtaining ground-truth labels, LLMs have been recently used as a proxy model for human ratings (e.g., AutoRater), which can take on the form of labels, preferences, or feedback. However, it can be challenging to fully evaluate LLMs' performance in NLP or Natural Language Understanding (NLU) task settings. We investigate using statistical measures of agreement and evaluate their potential for assessing the general quality of LLMs for text analysis and inference.

Keywords

test