MACO：基于访存视角的卷积网络自动代码优化

张晓扬* **; 肖俊敏*; 姚家树* **; 谭光明*

文章摘要

张晓扬* **,肖俊敏*,姚家树* **,谭光明*.MACO：基于访存视角的卷积网络自动代码优化[J].高技术通讯(中文),2023,33(12):1253~1264

MACO：基于访存视角的卷积网络自动代码优化

MACO: memory-based automatic code optimization of CNNs

DOI：10. 3772 / j. issn. 1002-0470. 2023. 12. 003

中文关键词: 内存优化；人工智能(AI)；推理；数据布局；自动调优

英文关键词: memory optimization, artificial intelligence (AI), inference, data layout, auto-tuning

基金项目:

作者	单位
张晓扬* **	（中国科学院计算技术研究所高性能计算机研究中心北京 100190）（*中国科学院大学北京 100049）
肖俊敏*
姚家树* **
谭光明*

摘要点击次数: 1355

全文下载次数: 941

中文摘要:

推理自动优化一直是人工智能（AI）与系统结构领域交叉的研究重点，但以访存为出发点的自动优化研究方案较少。本文从全局和局部两方面出发，针对数据布局和内核的自动优化问题，以访存的视角对卷积神经网络（CNN）自动代码优化中优化时间成本过高的问题进行研究。为有效分析访存，本文改进了经典的红蓝卵石访存模型的建模方法，提出了新的I/O下界估计方法，降低了多阶段复合算法的下界估计难度，并基于改进后的模型估计了卷积的I/O下界。根据卷积下界估计的结论，本文对数据流进行合理设计，有针对性地优化了自动模板生成技术下巨大的搜索空间，避免了大量无效搜索过程，使内核搜索效率较在未经优化的搜索空间中得到显著加速，并在一般性的卷积参数下较cuDNN有平均2.24倍的性能提升，保证了内核性能。同时本文借助神经网络实现了不同数据布局下的卷积性能预测，R2得分高于传统机器学习模型，且在ResNet-18、AlexNet和VGG-11模型中采用基于数据布局回溯算法和预测模型的混合布局策略较默认布局策略分别有1.28倍、1.32倍和1.29倍的性能提升。

英文摘要:

Inference automatic optimization has been the focus of research at the intersection of artificial intelligence (AI) and system architecture fields. However, there are fewer optimization research schemes based on memory. In this paper, the high time cost of automatic optimization of convolutional neural networks (CNN) data layout and kernel is studied and discussed from the perspective of memory from both global and local aspects. To perform the access analysis efficiently, the classical red-blue pebble game is re-explored and a new method is proposed to estimation I/O lower bound which reduces the difficulty of lower bound estimation for multi-stage composite algorithms. This work analyses the convolutional I/O lower bound based on the improved model and re-designs the data flow with the estimated results. This work purposefully optimizes the huge search space under the auto-template generation technique to avoid a large number of invalid search processes, so that the kernel search efficiency is significantly accelerated compared with the unoptimized search space, and the performance is improved by an average of 2.24× compared with cuDNN with general convolutional parameters, which ensures the kernel performance. This work also implements the convolutional performance prediction under different data layouts with the help of neural networks, and the R2 score is higher than that of traditional machine learning models. The performance of the hybrid layout strategy based on data layout backtracking algorithm and prediction model has 1.28×, 1.32×, and 1.29× improvement over the default layout strategy in ResNet-18, AlexNet, and VGG-11 models, respectively.

查看全文查看/发表评论下载PDF阅读器

关闭