Poster
First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs
Ben Norman · Jeff Clune
West Ballroom A-D #6407
Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. taking into account complex domain priors and adapting quickly to previous explorations). Across episodes, RL agents struggle to perform even simple exploration strategies, for example, systematic search that avoids exploring the same location multiple times. This lack of exploration limits performance on challenging domains. Meta-RL is a potential solution, as unlike standard-RL, meta-RL can learn to explore. This capacity potentially enables learning complex exploration strategies far beyond those of standard-RL, strategies such as experimenting in early episodes to learn new skills, or conducting experiments to learn about the current environment. Traditional meta-RL focuses on the problem of learning to optimally balance exploration and exploitation so as to maximize the cumulative reward of the episode sequence. We identify a new challenge with state-of-the-art cumulative-reward meta-RL methods. We show that when optimal behaviour requires forgoing reward in early episodes to explore, existing cumulative-reward meta-RL methods fail to learn good exploration and instead become trapped on a local optima of not exploring. We introduce a new method, First-Explore, which overcomes this limitation by learning two policies: one to solely explore, and one to solely exploit. When exploring and thus forgoing early-episode reward is required, First-Explore significantly outperforms existing cumulative meta-RL methods. By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of more human-like exploration on a broader range of domains.
Live content is unavailable. Log in and register to view live content