文章摘要
水超洋,谭光明.海光深度计算处理器上分析模型驱动的矩阵乘性能优化[J].高技术通讯(中文),2025,35(12):1263~1276
海光深度计算处理器上分析模型驱动的矩阵乘性能优化
Analytical model-driven matrix multiplication optimization on the Hygon deep compute unit
  
DOI:10. 3772 / j. issn. 1002-0470. 2025. 12. 001
中文关键词: 矩阵乘优化; 分析模型; 海光深度计算处理器
英文关键词: matrix multiplication optimization, analytical model, Hygon deep compute unit
基金项目:
作者单位
水超洋 (*处理器芯片全国重点实验室(中国科学院计算技术研究所)北京 100190) (**中国科学院大学北京 100049) 
谭光明  
摘要点击次数: 40
全文下载次数: 35
中文摘要:
      本文提出一种国产海光深度计算处理器(deep compute unit,DCU)上基于分析模型的稠密矩阵乘优化方法。高性能的算法实现需要将软件优化精确映射到硬件特性上。在各种不同的中央处理器(central processing unit,CPU)架构上,分析模型已被证明是一种有效的优化方法,可以根据不同的架构参数确定软件参数并获得与专家优化实现相当的性能。国产海光DCU加速器是国产高性能芯片的成功代表之一,对国产芯片自主可控有重要意义。然而DCU加速器上算法优化却缺乏方法指导,面临关键算法参数确定难、性能低、过度依赖经验等问题。本文以矩阵乘法的优化作为研究案例,提出了基于海光DCU架构的矩阵乘分析模型。首先,从硬件和算法2个方面入手,分别对海光DCU的一般架构特征和矩阵乘算法进行建模。在此基础上,本研究从带宽分析、延迟分析和资源分析3个角度建立了矩阵乘法的算法参数选择与底层硬件架构之间的联系,以此快速确定不同类型矩阵乘法在不同架构DCU上的关键算法参数。实验结果表明,根据分析模型推导的算法参数与专家选择的一致,模型驱动优化实现的矩阵乘性能可以达到与专家实现相当的水平。分析模型驱动的矩阵乘性能优化研究不仅可以为国产海光DCU上其他稠密计算优化提供参考,还为隐式优化经验的方法化提供了一种可行思路。
英文摘要:
      This paper presents an optimization method for dense matrix multiplication on the domestic Hygon deep compute unit (DCU) based on analytical model. High-performance algorithm implementations require precise mapping of software optimizations to hardware characteristics. Across various central processing unit(CPU) architectures, analytical models have been proven to be effective optimization methods. They enable the determination of software parameters based on different architecture parameters, achieving performance comparable to expert-tuned implementations. The domestic Hygon DCU accelerator is one of the successful representatives of domestic high-performance chips, which is of great significance for the autonomy and controllability of domestic chips. However, algorithm optimization on the DCU accelerator lacks guidance and faces challenges such as key algorithm parameter determination, low performance, and excessive reliance on experience. In this paper, we take the optimization of matrix multiplication as a case study and propose a matrix multiplication analytical model for the Hygon DCU architecture. Firstly, the general architectural features of Hygon DCU and matrix multiplication algorithms are modeled from hardware and algorithm perspectives respectively. Based on this, the proposed approach in this paper establishes the connection between algorithm parameter selection and underlying hardware architecture from three aspects: bandwidth analysis, latency analysis, and resource analysis. This enables quick determination of key algorithm parameters for different types of matrix multiplication on different architecture DCUs. Experimental results show that the algorithm parameters derived from the analytical model are consistent with those selected by experts, and the performance of the model-driven optimized matrix multiplication can achieve comparable performance with expert implementation. The research on performance optimization of matrix multiplication driven by analytical models not only provides reference for other dense computation optimizations on the domestic Hygon DCU, but also offers a feasible approach for methodizing implicit optimization experiences.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮