Skip to yearly menu bar Skip to main content


Counterexample Guided RL Policy Refinement Using Bayesian Optimization

Briti Gangopadhyay · Pallab Dasgupta


Keywords: [ Optimization ] [ Reinforcement Learning and Planning ]


Constructing Reinforcement Learning (RL) policies that adhere to safety requirements is an emerging field of study. RL agents learn via trial and error with an objective to optimize a reward signal. Often policies that are designed to accumulate rewards do not satisfy safety specifications. We present a methodology for counterexample guided refinement of a trained RL policy against a given safety specification. Our approach has two main components. The first component is an approach to discover failure trajectories using Bayesian optimization over multiple parameters of uncertainty from a policy learnt in a model-free setting. The second component selectively modifies the failure points of the policy using gradient-based updates. The approach has been tested on several RL environments, and we demonstrate that the policy can be made to respect the safety specifications through such targeted changes.

Chat is not available.