曹炅宣* ** ***,常明***,张蕊** ***,支天** ***,张曦珊** ***.面向知识蒸馏的自动梯度混合方法[J].高技术通讯(中文),2023,33(12):1276~1285
Automatic gradient blending for knowledge distillation
DOI:10. 3772 / j. issn. 1002-0470. 2023. 12. 005
中文关键词: 深度神经网络(DNN); 知识蒸馏(KD); 超参数优化(HPO); 图像分类
英文关键词: deep neural network(DNN), knowledge distillation (KD), hyperparameter optimization(HPO), image classification
曹炅宣* ** *** (*中国科学技术大学合肥 230026) (**中国科学院计算技术研究所北京 100190) (***中科寒武纪科技股份有限公司北京 100191) 
张蕊** ***  
支天** ***  
张曦珊** ***  
      Since the loss function of knowledge distillation (KD) contains a task loss from the ground truth and a distillation loss from the teacher network, how to efficiently find the suitable weights of the two losses remains an unsolved issue. To overcome this issue, this paper proposes an automatic gradient blending (AGB) method to automatically and efficiently find the suitable loss weights by searching the optimal blending gradient of the two losses. We mainly consider the original design of knowledge distillation that the distillation loss is the auxiliary of the task loss. AGB efficiently searches the blending gradient by only searching the gradient direction from the search space, which is the span of the gradient directions of the two losses, meanwhile constraining the norm of blending gradient the same as the gradient norm of task loss to significantly reduce the search space. The loss weights of two losses can be automatically computed from the optimal blending gradient, avoiding the time consuming manual tuning process. Extensive experiments on 10 different knowledge distillation methods between 13 different teacher-student combinations show the effectiveness and efficiency of AGB, which outperforms manual tuning methods over 70% combinations with a fewer computational resource.
