Timezone: »

CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum
Shuang Ao · Tianyi Zhou · Guodong Long · Qinghua Lu · Liming Zhu · Jing Jiang

Wed Dec 08 04:30 PM -- 06:00 PM (PST) @ Virtual

Goal-conditioned reinforcement learning (RL) usually suffers from sparse reward and inefficient exploration in long-horizon tasks. Planning can find the shortest path to a distant goal that provides dense reward/guidance but is inaccurate without a precise environment model. We show that RL and planning can collaboratively learn from each other to overcome their own drawbacks. In ''CO-PILOT'', a learnable path-planner and an RL agent produce dense feedback to train each other on a curriculum of tree-structured sub-tasks. Firstly, the planner recursively decomposes a long-horizon task to a tree of sub-tasks in a top-down manner, whose layers construct coarse-to-fine sub-task sequences as plans to complete the original task. The planning policy is trained to minimize the RL agent's cost of completing the sequence in each layer from top to bottom layers, which gradually increases the sub-tasks and thus forms an easy-to-hard curriculum for the planner. Next, a bottom-up traversal of the tree trains the RL agent from easier sub-tasks with denser rewards on bottom layers to harder ones on top layers and collects its cost on each sub-task train the planner in the next episode. CO-PILOT repeats this mutual training for multiple episodes before switching to a new task, so the RL agent and planner are fully optimized to facilitate each other's training. We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (SoRB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency.

Author Information

Shuang Ao (University of Technology Sydney)
Tianyi Zhou (University of Washington, Seattle)
Tianyi Zhou

Tianyi Zhou (https://tianyizhou.github.io) is a tenure-track assistant professor of computer science at the University of Maryland, College Park. He received his Ph.D. from the school of computer science & engineering at the University of Washington, Seattle. His research interests are in machine learning, optimization, and natural language processing (NLP). His recent works study curriculum learning that can combine high-level human learning strategies with model training dynamics to create a hybrid intelligence. The applications include semi/self-supervised learning, robust learning, reinforcement learning, meta-learning, ensemble learning, etc. He published >80 papers and is a recipient of the Best Student Paper Award at ICDM 2013 and the 2020 IEEE Computer Society TCSC Most Influential Paper Award.

Guodong Long (University of Technology Sydney (UTS))
Qinghua Lu (Data61, CSIRO)
Liming Zhu (CSIRO)
Jing Jiang (University of Technology Sydney)

More from the Same Authors