Workshop: Workshop on Machine Learning Safety

Adversarial Attacks on Feature Visualization Methods

Michael Eickenberg · Eugene Belilovsky · Jonathan Marty


The internal functional behavior of trained Deep Neural Networks is notoriously difficult to interpret. Feature visualization approaches are one set of techniques used to interpret and analyze trained deep learning models. On the other hand interpretability methods themselves may be subject to be deceived. In particular, we consider the idea of an adversary manipulating a model for the purpose of deceiving the interpretation. Focusing on the popular feature visualizations associated with CNNs we introduce an optimization framework for modifying the outcome of feature visualization methods.

Chat is not available.