文章摘要
YUAN Kaizhao (袁凯钊)* ** ***,ZHANG Rui*,PAN Yansong* ** ***,YI Qi* *** ****,PENG Shaohui*****,GUO Jiaming*,HE Wenkai* ** ***,HU Xing* ******.[J].高技术通讯(英文),2025,31(2):118~130
StM: a benchmark for evaluating generalization in reinforcement learning
  
DOI:10. 3772 / j. issn. 1006-6748. 2025. 02. 002
中文关键词: 
英文关键词: reinforcement learning (RL), generalization, benchmark, environment
基金项目:
Author NameAffiliation
YUAN Kaizhao (袁凯钊)* ** *** (*State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences,Beijing 100190, P. R. China) (** University of Chinese Academy of Sciences, Beijing 101408, P. R. China) (***Cambricon Technologies, Beijing 100089, P. R. China) (**** University of Science and Technology of China, Hefei 230026, P. R. China) (*****Intelligent Software Research Center, Institute of Software, Chinese Academy of Sciences, Beijing 100190, P. R. China) (******Shanghai Innovation Center for Processor Technologies, Shanghai 201210, P. R. China) 
ZHANG Rui*  
PAN Yansong* ** ***  
YI Qi* *** ****  
PENG Shaohui*****  
GUO Jiaming*  
HE Wenkai* ** ***  
HU Xing* ******  
Hits: 33
Download times: 37
中文摘要:
      
英文摘要:
      The challenge of enhancing the generalization capacity of reinforcement learning (RL) agents remains a formidable obstacle. Existing RL methods, despite achieving superhuman performance on certain benchmarks, often struggle with this aspect. A potential reason is that the benchmarks used for training and evaluation may not adequately offer a diverse set of transferable tasks. Although recent studies have developed bench-marking environments to address this shortcoming, they typically fall short in providing tasks that both ensure a solid foundation for generalization and exhibit significant variability. To overcome these limitations, this work introduces the concept that ‘ objects are composed of more fundamental components’ in environment design, as implemented in the proposed environment called summon the magic (StM). This environment generates tasks where objects are derived from extensible and shareable basic components, facilitating strategy reuse and enhancing generalization. Furthermore, two new metrics, adaptation sensitivity range (ASR) and parameter correlation coefficient (PCC), are proposed to better capture and evaluate the generalization process of RL agents. Experimental results show that increasing the number of basic components of the object reduces the proximal policy optimization (PPO) agent’s training-testing gap by 60. 9% (in episode reward), significantly alleviating overfitting. Additionally, linear variations in other environmental factors, such as the training monster set proportion and the total number of basic components, uniformly decrease the gap by at least 32. 1% . These results highlight StM’s effectiveness in benchmarking and probing the generalization capabilities of RL algorithms.
View Full Text   View/Add Comment  Download reader
Close

分享按钮