面向知识蒸馏的自动梯度混合方法

曹炅宣* ** ***; 常明***; 张蕊** ***; 支天** ***; 张曦珊** ***

文章摘要

曹炅宣* ** ***,常明***,张蕊** ***,支天** ***,张曦珊** ***.面向知识蒸馏的自动梯度混合方法[J].高技术通讯(中文),2023,33(12):1276~1285

面向知识蒸馏的自动梯度混合方法

Automatic gradient blending for knowledge distillation

DOI：10. 3772 / j. issn. 1002-0470. 2023. 12. 005

中文关键词: 深度神经网络（DNN）；知识蒸馏(KD)；超参数优化（HPO）；图像分类

英文关键词: deep neural network（DNN）, knowledge distillation (KD), hyperparameter optimization（HPO）, image classification

基金项目:

作者	单位
曹炅宣* *	（中国科学技术大学合肥 230026）（中国科学院计算技术研究所北京 100190）（**中科寒武纪科技股份有限公司北京 100191）
常明***
张蕊 *
支天 *
张曦珊 *

摘要点击次数: 1506

全文下载次数: 1114

中文摘要:

在知识蒸馏(KD)中，学生网络会同时受到真实数据的监督和来自教师网络的监督，因此在训练中，其损失函数包含有来自真实标签的任务损失和来自教师网络的蒸馏损失，而如何有效配置损失函数的权重至今仍是一个未解决的问题。为了克服这个问题，本文提出了一种自动梯度混合(AGB)方法，通过搜索这2个损失的最佳混合梯度来自动有效地找到合适的损失权重。在知识蒸馏的原始设计中，蒸馏损失是用来辅助任务损失进行训练，因此本文将混合梯度的模长约束为任务损失对应梯度模长，仅仅只搜索梯度向量的方向，从而显著缩减了搜索空间。在搜索得到最佳混合梯度后，2个损失的损失权重可以被自动计算出来，从而避免了耗时的手动调节过程。本文在13种不同的师生网络组合以及10种不同的知识蒸馏方法间进行了大量的实验。结果表明，自动梯度混合方法能够在使用更少计算资源的条件下，在70%的蒸馏方法上比手动调节方法结果更优。

英文摘要:

Since the loss function of knowledge distillation (KD) contains a task loss from the ground truth and a distillation loss from the teacher network, how to efficiently find the suitable weights of the two losses remains an unsolved issue. To overcome this issue, this paper proposes an automatic gradient blending (AGB) method to automatically and efficiently find the suitable loss weights by searching the optimal blending gradient of the two losses. We mainly consider the original design of knowledge distillation that the distillation loss is the auxiliary of the task loss. AGB efficiently searches the blending gradient by only searching the gradient direction from the search space, which is the span of the gradient directions of the two losses, meanwhile constraining the norm of blending gradient the same as the gradient norm of task loss to significantly reduce the search space. The loss weights of two losses can be automatically computed from the optimal blending gradient, avoiding the time consuming manual tuning process. Extensive experiments on 10 different knowledge distillation methods between 13 different teacher-student combinations show the effectiveness and efficiency of AGB, which outperforms manual tuning methods over 70% combinations with a fewer computational resource.

查看全文查看/发表评论下载PDF阅读器

关闭