基于列向量稀疏的神经网络剪枝方法

谭懿峻* **; 杜子东* ***

文章摘要

谭懿峻* **,杜子东* ***.基于列向量稀疏的神经网络剪枝方法[J].高技术通讯(中文),2025,35(9):960~968

基于列向量稀疏的神经网络剪枝方法

Column-wise vector sparsity based neural network pruning method

DOI：10. 3772 / j. issn. 1002-0470. 2025. 09. 005

中文关键词: 神经网络；剪枝；卷积；矩阵乘

英文关键词: neural network, pruning, convolution, matrix multiplication

基金项目:

作者	单位
谭懿峻* **	（中国科学院计算技术研究所处理器芯片全国重点实验室北京 100190）（中国科学院大学北京 100049） (**上海处理器技术创新中心上海 201815)
杜子东* ***

摘要点击次数: 21

全文下载次数: 18

中文摘要:

权重剪枝是减少神经网络模型大小和计算成本极为有效的方法。然而，非零权重通常在稀疏网络模型中随机分布，这使得在通用硬件（如图形处理器（graphics processing unit，GPU））上相对于原稠密模型的实现，稀疏模型的实际计算加速变得非常困难。当前有2种加速解决方案：第1种是通过修改硬件结构以支持不规则内存访问，第2种则是使用部分结构化的稀疏模式。然而这2种方法都无法实现实用有效的加速。为了解决该问题，本文提出了一种基于新型OVW（out-vector-wise）稀疏格式的算法-软件协同设计的稀疏卷积。OVW格式将V×1向量视为一个剪枝整体，这种基于列向量完整性的格式可以在卷积至矩阵乘的映射中保留其对内存访问的连续性。为了减少稀疏性引起的网络精度损失，本文提出了一种计算等效的矩阵变换过程，即行通道排列重排序，将权重大小分布相似的行聚集在一起。实验结果表明，在NVIDIA V100上，本文的方法在75%稀疏度下，相对于SOTA(state-of-the-art)解决方案和ResNet50的密集卷积，实现了3.2倍的加速，同时只有微不足道的精度损失。此外，与仅在60%或更高稀疏度的数据上实现加速的SOTA解决方案相比，本文的方法在仅具有10%稀疏度的数据上也开始获得加速。

英文摘要:

Weight pruning is an extremely effective method to reduce the size and computational cost of neural network models. However, non-zero weights are often randomly distributed in sparse network models, making it challenging to achieve practical computational acceleration on general-purpose hardware (such as GPU) relative to the implementation of the original dense model. Existing acceleration solutions either require hardware modifications to support irregular memory accesses or rely on partially structured sparse patterns. Both approaches fail to achieve practical and effective acceleration. To address this issue, this paper proposes a sparse convolution based on a novel out-vector-wise (OVW) sparse format that combines algorithmic and software design. The OVW format treats V×1 vectors as a pruning unit, preserving the continuity of memory access in the mapping from convolution to matrix multiplication based on the integrity of column vectors. To reduce the loss of network accuracy caused by sparsity, this paper proposes a computationally equivalent matrix transformation process, namely row channel permutation reordering, which groups rows with similar weight size distributions together. Experimental results demonstrate that, on the NVIDIA V100, the proposed method achieves a 3.2× acceleration at 75% sparsity compared with the state-of-the-art (SOTA) solutions and dense convolutions for ResNet50, with only negligible accuracy loss. Furthermore, compared with SOTA solutions that only achieve acceleration on data with 60% or higher sparsity, our method starts to obtain acceleration even on data with only 10% sparsity.

查看全文查看/发表评论下载PDF阅读器

关闭