Multi-Teacher Bayesian Knowledge Distillation

Yongkai Chen Co-Author
 
Ping Ma Co-Author
University of Georgia
 
Wenxuan Zhong Co-Author
University of Georgia
 
Luyang Fang First Author
University of Georgia
 
Luyang Fang Presenting Author
University of Georgia
 
Monday, Aug 4: 9:50 AM - 10:05 AM
1907 
Contributed Papers 
Music City Center 
Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce Multi-Teacher Bayesian Knowledge Distillation (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances interpretability, improves predictive accuracy, and provides uncertainty quantification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of MT-BKD.

Keywords

Uncertainty Quantification

Large Language Models

Bayesian Priors

Image Classification

Protein Subcellular Prediction 

Main Sponsor

Section on Statistical Computing