Workshop: Workshop on Machine Learning Safety

Interpretable Reward Learning via Differentiable Decision Trees

Akansha Kalra · Daniel S. Brown


There is an increasing interest in learning rewards and models of human intent from human feedback. However, many methods use blackbox learning methods that, while expressive, are hard to interpret. We propose a novel method for learning expressive and interpretable reward functions from preference feedback using differentiable decision trees. We test our algorithm on two test domains, demonstrating the ability to learn interpretable reward functions from both low- and high-dimensional visual state inputs. Furthermore, we provide preliminary evidence that the tree structure of our learned reward functions is useful in determining the extent to which a reward function is aligned with human preferences.

Chat is not available.