杨灿,王重熙,章隆兵.基于层间融合的神经网络访存密集型层加速[J].高技术通讯(中文),2023,33(8):823~835 |
基于层间融合的神经网络访存密集型层加速 |
Accelerating memory intensive layer of neural networks with layer fusion |
|
DOI:10. 3772/ j. issn. 1002-0470. 2023. 08. 005 |
中文关键词: 神经网络; 训练; 加速器; 卷积神经网络(CNN); 访存密集型层; 批归一化(BN)层 |
英文关键词: neural network, training, accelerator, convolutional neural network(CNN), memory intensive layer, batch normalization(BN) layer |
基金项目: |
作者 | 单位 | 杨灿 | (处理器芯片国家重点实验室(中国科学院计算技术研究所)北京 100190)
(中国科学院计算技术研究所北京 100190)
(中国科学院大学北京 100049) | 王重熙 | | 章隆兵 | |
|
摘要点击次数: 850 |
全文下载次数: 751 |
中文摘要: |
近年来,随着深度神经网络在各领域的广泛应用,针对不同的应用场景,都需要对神经网络模型进行训练以获得更优的参数,于是对训练速度的需求不断提升。然而,现有的研究通常只关注了计算密集型层的加速,忽略了访存密集型层的加速。访存密集型层的操作主要由访存带宽决定执行效率,单独提升运算速度对性能影响不大。本文从执行顺序的角度出发,提出了将访存密集型层与其前后的计算密集型层融合为一个新层执行的方式,将访存密集型层的操作作为对融合新层中输入数据的前处理或输出数据的后处理进行,大幅减少了访存密集型层在训练过程中对片外内存的访问,提升了性能;并针对该融合执行方案,设计实现了一个面向训练的加速器,采用了暂存前处理结果、后处理操作与计算密集型层操作并行执行的优化策略,进一步提升了融合新层的训练性能。实验结果显示,在面积增加6.4%、功耗增加10.3%的开销下,训练的前向阶段、反向阶段的性能分别实现了67.7%、77.6%的提升。 |
英文摘要: |
Recently, deep neural networks are widely used in various fields. It is necessary to train each neural network model to get better model parameters for different application scenarios. Thus, the demand for training speed is increasing. However, the existing research usually focuses on the acceleration of computation intensive layers but ignores the acceleration of memory intensive layers.The efficiency of the memory intensive layer is mainly determined by the memory bandwidth, thus only improving the computation speed has little effect on the layer’s performance. From the perspective of execution order, this paper proposes a method of fusing the memory intensive layer and the computation intensive layer before it or behind it into a new fused layer, and the operation of memory intensive layer is performed as the pre-processing of the input data or the post-processing of the origin output data in the fused layer. Thus, it reduces the access of the memory intensive layer to off-chip memory greatly during training. Based on the fusion method, a new accelerator for training is implemented, adopting the optimization strategy to further improve the training performance, including temporarily storing the pre-processing results and concurrently executing the operations of post-processing and the computation intensive layer. The experimental results show that the performance is improved by 67.7% and 77.6% respectively in the forward propagation stage and backward propagation stage of training at the cost of 6.4% increase in area and 10.3% increase in power. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |
|
|
|