Poster
in
Workshop: ML x OR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making

A Deep Proactive Exploration Policy Based on Asymptotic Statistics for Asynchronous Q-Learning

Xinbo Shi ⋅ Jinyang Jiang ⋅ Ruihan Zhou ⋅ Yijie Peng ⋅ Jing Dong

Project Page [ OpenReview]

Abstract

This paper presents a new methodology that adaptively optimizes exploration policy for the reinforcement learning problem. We shift the objective from value-function accuracy to direct policy identification, using the probability of correct selection as a metric. We establish a new central limit theorem for asynchronous Q-learning with adaptive step sizes and propose a regularized signal-to-noise ratio index for exploration policy designing. To address the computational cost of the high-dimensional optimization, we propose a novel pipeline with an offline, simulation-based proactive learning loop. This loop trains a deep neural network to serve as a fast, low-dimensional proxy for the complex optimization problem, allowing for efficient online policy updates in real-world applications. We validate our approach on the challenging RiverSwim environment, demonstrating superior performance compared to standard exploration heuristics.

Chat is not available.