Aligning Large Language Models with Consensus between Preference and Policy

jiancong xiao Speaker
Department of Biostatistics and Epidemiology, University of Pennsylvania
 
Tuesday, Aug 6: 2:05 PM - 2:25 PM
Topic-Contributed Paper Session 
Oregon Convention Center 
The rapid advancement of Large Language Models (LLMs) presents significant opportunities in the pursuit of artificial intelligence (AI), while simultaneously raising critical safety concerns. This necessitates the need for robust AI alignment strategies. Reinforcement Learning from Human Feedback (RLHF) has been identified as a promising approach for achieving AI alignment. However, a notable challenge with this method is its susceptibility to mode collapse, leading to a decrease in the diversity of model outputs. Our paper identifies a key issue in this regard: the algorithmic bias inherent in RLHF. We find that such an algorithmic bias is due to the inconsensus between the preference learning (step 2) and policy learning (step 3) stages in RLHF. To tackle this issue, we establish the necessary and sufficient conditions for achieving preference-policy consensus (PPC). We theoretically demonstrate that the global solutions in policy learning for both PPC-RLHF are aligned with the preference. By doing so, these approaches effectively counteract the algorithmic bias inherent in RLHF, thus facilitating a more equitable and well-aligned large language model.