Workshop: Workshop on Machine Learning Safety

A general framework for reward function distances

Erik Jenner · Joar Skalse · Adam Gleave


In reward learning, it is helpful to be able to measure distances between reward functions, for example to evaluate learned reward models. Using simple metrics such as L^2 distances is not ideal because reward functions that are equivalent in terms of their optimal policies can nevertheless have high L^2 distance. EPIC and DARD are distances specifically designed for reward functions that address this by being invariant under certain transformations that leave optimal policies unchanged. However, EPIC and DARD are designed in an ad-hoc manner, only consider a subset of relevant reward transformations, and suffer from serious pathologies in some settings. In this paper, we define a general class of reward function distance metrics, of which EPIC is a special case. This framework lets as address all these issues with EPIC and DARD, and allows for the development of reward function distance metrics in a more principled manner.

Chat is not available.