基于ARM架构的线性标度三维分块算法优化及性能分析

严昱瑾* **; 谭光明* **; 贾伟乐* **

文章摘要

严昱瑾* **,谭光明* **,贾伟乐* **.基于ARM架构的线性标度三维分块算法优化及性能分析[J].高技术通讯(中文),2025,35(12):1277~1290

基于ARM架构的线性标度三维分块算法优化及性能分析

Optimization and performance analysis of the linear scaling three-dimensional fragment method on ARM architecture

DOI：10. 3772 / j. issn. 1002-0470. 2025. 12. 002

中文关键词: 高性能计算；电子结构；第一性原理计算；线性标度三维分块算法；富岳超级计算机；性能模型

英文关键词: high-performance computing, electronic structures, first-principles calculations, linear scaling three-dimensional fragment method, Fugaku supercomputer, performance model

基金项目:

作者	单位
严昱瑾* **	（高性能计算机研究中心（中国科学院计算技术研究所）北京 100190）（*中国科学院大学北京 100049）
谭光明* **
贾伟乐* **

摘要点击次数: 37

全文下载次数: 32

中文摘要:

随着半导体器件尺寸缩小到纳米级，量子效应对半导体器件模拟的影响显著增强，迫使计算机辅助设计引入电子结构计算。面对大规模电子结构计算在计算复杂度和通信负载上的双重挑战，本文在超级计算机富岳平台上对线性标度三维分块算法(linear scaling three dimensional fragment，LS3DF)进行了算法和系统级优化，显著提升了计算效率和扩展性。算法层面优化包括采用混合精度策略和对全能带共轭梯度算法进行角度优化。在系统方面，提出采用粗粒度并行策略、能带分块策略和三维快速傅立叶变换(fast Fourier transform，FFT)策略。上述优化措施使得计算效率提高4.61倍，并且大规模测试中在2 560个节点上计算效率达到93.69%。此外，从研究中抽象出了一套性能模型，其估计时间与实际运行时间的误差小于5.00%。

英文摘要:

As semiconductor device dimensions shrink to the nanoscale, the impact of quantum effects on semiconductor device simulations has become increasingly significant, necessitating the integration of electronic structure calculations into computer-aided design. Faced with the dual challenges of computational complexity and communication load in large-scale electronic structure calculations, this paper presents optimizations of linear scaling three-dimensional fragment method (LS3DF) at both the algorithmic and system levels on the Fugaku supercomputer platform, significantly enhancing computational efficiency and scalability. Algorithmic improvements include the adoption of a mixed-precision strategy and angular optimization of the all-band conjugate gradient method. At the system level, a coarse-grained parallel strategy, band blocking strategy, and three-dimensional fast Fourier transform (FFT) strategy are proposed. These optimizations result in a 4.61-fold increase in computational efficiency, with efficiency reaching 93.69% in large-scale tests involving 2560 nodes. Additionally, a performance model is abstracted from the research, showing less than 5.00% discrepancy between estimated and actual running times.

查看全文查看/发表评论下载PDF阅读器

关闭