Distribution-aware Pruning Strategy for Large Language Models: from Unstructured to Structured Pruning

Qi Lei Speaker
New York University
 
Tuesday, Aug 6: 2:45 PM - 3:05 PM
Topic-Contributed Paper Session 
Oregon Convention Center 
Recent progress in artificial general intelligence has led to large language models (LLMs) with billions of parameters. This scale necessitates the removal of unnecessary neurons or weights through model pruning. Traditional pruning methods typically focus on the magnitude of weights in a deterministic manner. However, using weight magnitude is a local metric without considering how it affects the model globally, and deterministic pruning can introduce errors that accumulate across layers. Conversely, randomized pruning can help even out these errors across different layers. In this talk, we introduce two inference-aware pruning criteria derived from the optimization perspective of output approximation, which surpass traditional training-aware metrics such as gradient and Hessian. Moreover, we introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Our experimental results showcase the superior performance of this approach across various datasets and models, markedly reducing both computational costs and hardware requirements.