A Statistical Method for Safety Alignment of LLMs

Radha Poovendran Co-Author
University of Washington
 
Zhangchen Xu Speaker
University of Washington
 
Tuesday, Aug 6: 9:50 AM - 10:15 AM
Invited Paper Session 
Oregon Convention Center 
As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant security threat to LLM deployment.

This talk introduces a statistical method to ensure the safety alignment of LLMs. We observe that safe and unsafe behaviors exhibited by LLMs differ in the probability distributions of tokens. An unsafe response generated by LLMs corresponds to a distribution where the probabilities of tokens representing harmful contents outweigh those representing harmless responses. We leverage this observation and develop a lightweight safety-aware decoding strategy, SafeDecoding, for safety alignment. SafeDecoding mitigates jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences aligned with jailbreak attacks' objectives with the help of token distribution shifts. We perform extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. This work is supported by the NSF AI Institute for Agent-based Cyber Threat Intelligence and Operation (ACTION).