Print Close

Aligning Large Language Models with Consensus between Preference and Policy

Presented During: Harnessing Large Language Models: Opportunities and Challenges for Statistics

jiancong xiao Speaker
Department of Biostatistics and Epidemiology, University of Pennsylvania

Tuesday, Aug 6: 2:05 PM - 2:25 PM
Topic-Contributed Paper Session

Oregon Convention Center

The rapid advancement of Large Language Models (LLMs) presents significant opportunities in the pursuit of artificial intelligence (AI), while simultaneously raising critical safety concerns. This necessitates the need for robust AI alignment strategies. Reinforcement Learning from Human Feedback (RLHF) has been identified as a promising approach for achieving AI alignment. However, a notable challenge with this method is its susceptibility to mode collapse, leading to a decrease in the diversity of model outputs. Our paper identifies a key issue in this regard: the algorithmic bias inherent in RLHF. We find that such an algorithmic bias is due to the inconsensus between the preference learning (step 2) and policy learning (step 3) stages in RLHF. To tackle this issue, we establish the necessary and sufficient conditions for achieving preference-policy consensus (PPC). We theoretically demonstrate that the global solutions in policy learning for both PPC-RLHF are aligned with the preference. By doing so, these approaches effectively counteract the algorithmic bias inherent in RLHF, thus facilitating a more equitable and well-aligned large language model.