BAN Jian(班 健)* **,LI Gongyan*,XU Shaoyun*.[J].高技术通讯(英文),2024,30(4):344~355 |
|
SPaRM: an efficient exploration and planning framework for sparse reward reinforcement learning |
|
DOI:10. 3772 / j. issn. 1006-6748. 2024. 04. 002 |
中文关键词: |
英文关键词: reinforcement learning (RL), sparse reward, reward-free exploration (RFE),space partitioning (SP), reverse merging (RM) |
基金项目: |
Author Name | Affiliation | BAN Jian(班 健)* ** | (* Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, P. R. China)
(** University of Chinese Academy of Sciences, Beijing 100049, P. R. China) | LI Gongyan* | | XU Shaoyun* | |
|
Hits: 66 |
Download times: 110 |
中文摘要: |
|
英文摘要: |
Due to the issue of long-horizon, a substantial number of visits to the state space is required during the exploration phase of reinforcement learning (RL) to gather valuable information. Additionally, due to the challenge posed by sparse rewards, the planning phase of reinforcement learning consumes a considerable amount of time on repetitive and unproductive tasks before adequately accessing sparse reward signals. To address these challenges, this work proposes a space partitioning and reverse merging (SPaRM) framework based on reward-free exploration (RFE). The framework consists of two parts: the space partitioning module and the reverse merging module. The former module partitions the entire state space into a specific number of subspaces to expedite the exploration phase. This work establishes its theoretical sample complexity lower bound. The latter module starts planning in reverse from near the target and gradually extends to the starting state, as opposed to the conventional practice of starting at the beginning. This facilitates the early involvement of sparse rewards at the target in the policy update process. This work designs two experimental environments: a complex maze and a set of randomly generated maps. Compared with two state-of-the-art (SOTA) algorithms, experimental results validate the effectiveness and superior performance of the proposed algorithm. |
View Full Text
View/Add Comment Download reader |
Close |
|
|
|