Print Close

On the Use of Bandits and Low-Rank Factorization to Speed up LLM-based Evaluation

Presented During: Harnessing Large Language Models: Opportunities and Challenges for Statistics

Ruihan Wu Speaker

Tuesday, Aug 6: 2:25 PM - 2:45 PM
Topic-Contributed Paper Session

Oregon Convention Center

Natural language generation has achieved such a high level of proficiency it has become very challenging to compare the performance of one language model to another. As traditional methods such as BLEU and ROUGE are too brittle, it is now common practice to depend upon another, often larger, language model either implicitly or explicitly to score and compare generations. Dependence upon large language models (LLMs) such as GPT4 to score generations is incredibly costly in terms of money, compute, and time. We aim to reduce the burden of these evaluations with respect to all three of these resources. First we observe that these evaluation matrices are intrinsically low rank, and well approximated by low rank factorizations. Further, we build upon the well studied multi arm bandit framework proposing a range of algorithms for selecting the best language model. The algorithms span from those with strong theoretical guarantees to those with empirically strong performance. We find our methods can typically identify the top performer with 5-15% of the typically required resources - that is an 85-95% percent.