LI Nan(李 楠)* ** ***,ZHAO Yongwei**,ZHI Tian**,LIU Chang** ***,DU Zidong**,HU Xing**,LI Wei**,ZHANG Xishan** ***,LI Ling****,SUN Guangzhong*.[J].高技术通讯(英文),2024,30(1):52~60 |
|
Cambricon-QR: a sparse and bitwise reproducible quantized training accelerator |
|
DOI:10. 3772 / j. issn. 1006-6748. 2024. 01. 006 |
中文关键词: |
英文关键词: quantized training, sparse accelerator, Cambricon-QR |
基金项目: |
Author Name | Affiliation | LI Nan(李 楠)* ** *** | (* School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, P. R. China)
(** State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences,Beijing 100086, P. R. China)
(*** Cambricon Tech. Ltd, Beijing 100191, P. R. China)
(**** Institute of Software, Chinese Academy of Sciences, Beijing 100086, P. R. China) | ZHAO Yongwei** | | ZHI Tian** | | LIU Chang** *** | | DU Zidong** | | HU Xing** | | LI Wei** | | ZHANG Xishan** *** | | LI Ling**** | | SUN Guangzhong* | |
|
Hits: 470 |
Download times: 475 |
中文摘要: |
|
英文摘要: |
Quantized training has been proven to be a prominent method to achieve deep neural network training under limited computational resources. It uses low bit-width arithmetics with a proper scaling factor to achieve negligible accuracy loss. Cambricon-Q is the ASIC design proposed to efficiently support quantized training, and achieves significant performance improvement. However, there are still two caveats in the design. First, Cambricon-Q with different hardware specifications may lead to different numerical errors, resulting in non-reproducible behaviors which may become a major concern in critical applications. Second, Cambricon-Q cannot leverage data sparsity, where con-siderable cycles could still be squeezed out. To address the caveats, the acceleration core of Cambricon-Q is redesigned to support fine-grained irregular data processing. The new design not only enables acceleration on sparse data, but also enables performing local dynamic quantization by contiguous value ranges (which is hardware independent), instead of contiguous addresses (which is dependent on hardware factors). Experimental results show that the accuracy loss of the method still keeps negligible, and the accelerator achieves 1. 61 × performance improvement over Cambricon-Q,with about 10% energy increase. |
View Full Text
View/Add Comment Download reader |
Close |
|
|
|