Timezone: »
Feature attributions are a popular tool for explaining the behavior of Deep Neural Networks (DNNs), but have recently been shown to be vulnerable to attacks that produce divergent explanations for nearby inputs. This lack of robustness is especially problematic in high-stakes applications where adversarially-manipulated explanations could impair safety and trustworthiness. Building on a geometric understanding of these attacks presented in recent work, we identify Lipschitz continuity conditions on models' gradient that lead to robust gradient-based attributions, and observe that smoothness may also be related to the ability of an attack to transfer across multiple attribution methods. To mitigate these attacks in practice, we propose an inexpensive regularization method that promotes these conditions in DNNs, as well as a stochastic smoothing technique that does not require re-training. Our experiments on a range of image models demonstrate that both of these mitigations consistently improve attribution robustness, and confirm the role that smooth geometry plays in these attacks on real, large-scale models.
Author Information
Zifan Wang (Carnegie Mellon University)
Haofan Wang (Carnegie Mellon University)
Shakul Ramkumar (Carnegie Mellon University)
Piotr Mardziel (Carnegie Mellon University)
Matt Fredrikson (CMU)
Anupam Datta (Carnegie Mellon University)
More from the Same Authors
-
2021 Poster: Influence Patterns for Explaining Information Flow in BERT »
Kaiji Lu · Zifan Wang · Piotr Mardziel · Anupam Datta -
2021 : Exploring Conceptual Soundness with TruLens »
Anupam Datta · Matt Fredrikson · Klas Leino · Kaiji Lu · Shayak Sen · Ricardo C Shih · Zifan Wang -
2021 Poster: Relaxing Local Robustness »
Klas Leino · Matt Fredrikson -
2018 Workshop: Workshop on Security in Machine Learning »
Nicolas Papernot · Jacob Steinhardt · Matt Fredrikson · Kamalika Chaudhuri · Florian Tramer -
2018 Poster: Hunting for Discriminatory Proxies in Linear Regression Models »
Samuel Yeom · Anupam Datta · Matt Fredrikson