On the Use of Bandits and Low-Rank Factorization to Speed up LLM-based Evaluation

Ruihan Wu Speaker
 
Tuesday, Aug 6: 2:25 PM - 2:45 PM
Topic-Contributed Paper Session 
Oregon Convention Center 
Natural language generation has achieved such a high level of proficiency it has become very challenging to compare the performance of one language model to another. As traditional methods such as BLEU and ROUGE are too brittle, it is now common practice to depend upon another, often larger, language model either implicitly or explicitly to score and compare generations. Dependence upon large language models (LLMs) such as GPT4 to score generations is incredibly costly in terms of money, compute, and time. We aim to reduce the burden of these evaluations with respect to all three of these resources. First we observe that these evaluation matrices are intrinsically low rank, and well approximated by low rank factorizations. Further, we build upon the well studied multi arm bandit framework proposing a range of algorithms for selecting the best language model. The algorithms span from those with strong theoretical guarantees to those with empirically strong performance. We find our methods can typically identify the top performer with 5-15% of the typically required resources - that is an 85-95% percent.