Multi-Teacher Bayesian Knowledge Distillation
Ping Ma
Co-Author
University of Georgia
Monday, Aug 4: 9:50 AM - 10:05 AM
1907
Contributed Papers
Music City Center
Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce Multi-Teacher Bayesian Knowledge Distillation (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances interpretability, improves predictive accuracy, and provides uncertainty quantification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of MT-BKD.
Uncertainty Quantification
Large Language Models
Bayesian Priors
Image Classification
Protein Subcellular Prediction
Main Sponsor
Section on Statistical Computing
You have unsaved changes.