Timezone: »
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts that closely approximate neuron behavior. Compared to prior work that uses atomic labels as explanations, analyzing neurons compositionally allows us to more precisely and expressively characterize their behavior. We use this procedure to answer several questions on interpretability in models for vision and natural language processing. First, we examine the kinds of abstractions learned by neurons. In image classification, we find that many neurons learn highly abstract but semantically coherent visual concepts, while other polysemantic neurons detect multiple unrelated features; in natural language inference (NLI), neurons learn shallow lexical heuristics from dataset biases. Second, we see whether compositional explanations give us insight into model performance: vision neurons that detect human-interpretable concepts are positively correlated with task performance, while NLI neurons that fire for shallow heuristics are negatively correlated with task performance. Finally, we show how compositional explanations provide an accessible way for end users to produce simple "copy-paste" adversarial examples that change model behavior in predictable ways.
Author Information
Jesse Mu (Stanford University)
Jacob Andreas (MIT)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Oral: Compositional Explanations of Neurons »
Thu. Dec 10th 02:30 -- 02:45 PM Room Orals & Spotlights: Deep Learning
More from the Same Authors
-
2022 : In the ZONE: Measuring difficulty and progression in curriculum generation »
Rose Wang · Jesse Mu · Dilip Arumugam · Natasha Jaques · Noah Goodman -
2022 Workshop: LaReL: Language and Reinforcement Learning »
Laetitia Teodorescu · Laura Ruis · Tristan Karch · Cédric Colas · Paul Barde · Jelena Luketina · Athul Jacob · Pratyusha Sharma · Edward Grefenstette · Jacob Andreas · Marc-Alexandre Côté -
2022 Poster: Active Learning Helps Pretrained Models Learn the Intended Task »
Alex Tamkin · Dat Nguyen · Salil Deshpande · Jesse Mu · Noah Goodman -
2022 Poster: STaR: Bootstrapping Reasoning With Reasoning »
Eric Zelikman · Yuhuai Wu · Jesse Mu · Noah Goodman -
2022 Poster: Improving Policy Learning via Language Dynamics Distillation »
Victor Zhong · Jesse Mu · Luke Zettlemoyer · Edward Grefenstette · Tim Rocktäschel -
2022 Poster: Pre-Trained Language Models for Interactive Decision-Making »
Shuang Li · Xavier Puig · Chris Paxton · Yilun Du · Clinton Wang · Linxi Fan · Tao Chen · De-An Huang · Ekin Akyürek · Anima Anandkumar · Jacob Andreas · Igor Mordatch · Antonio Torralba · Yuke Zhu -
2022 Poster: Improving Intrinsic Exploration with Language Abstractions »
Jesse Mu · Victor Zhong · Roberta Raileanu · Minqi Jiang · Noah Goodman · Tim Rocktäschel · Edward Grefenstette -
2021 : Q/A Session »
Alice Xiang · Jacob Andreas -
2021 : [IT5] Natural language descriptions of deep features »
Jacob Andreas -
2021 : Multi-party referential communication in complex strategic games »
Jessica Mankewitz · Veronica Boyce · Brandon Waldon · Georgia Loukatou · Dhara Yu · Jesse Mu · Noah Goodman · Michael C Frank -
2021 Poster: Emergent Communication of Generalizations »
Jesse Mu · Noah Goodman -
2021 Poster: Teachable Reinforcement Learning via Advice Distillation »
Olivia Watkins · Abhishek Gupta · Trevor Darrell · Pieter Abbeel · Jacob Andreas -
2020 Poster: A Benchmark for Systematic Generalization in Grounded Language Understanding »
Laura Ruis · Jacob Andreas · Marco Baroni · Diane Bouchacourt · Brenden Lake -
2019 : Panel Discussion »
Jacob Andreas · Edward Gibson · Stefan Lee · Noga Zaslavsky · Jason Eisner · Jürgen Schmidhuber -
2019 : Invited Talk - 4 »
Jacob Andreas -
2018 Poster: Speaker-Follower Models for Vision-and-Language Navigation »
Daniel Fried · Ronghang Hu · Volkan Cirik · Anna Rohrbach · Jacob Andreas · Louis-Philippe Morency · Taylor Berg-Kirkpatrick · Kate Saenko · Dan Klein · Trevor Darrell -
2017 : Afternoon Panel discussion »
Brian Skyrms · Satinder Singh · Jacob Andreas -
2017 : Poster session (and Coffee Break) »
Jacob Andreas · Kun Li · Conner Vercellino · Thomas Miconi · Wenpeng Zhang · Luca Franceschi · Zheng Xiong · Karim Ahmed · Laurent Itti · Tim Klinger · Mostafa Rohaninejad -
2015 Poster: On the Accuracy of Self-Normalized Log-Linear Models »
Jacob Andreas · Maxim Rabinovich · Michael Jordan · Dan Klein -
2014 Poster: Unsupervised Transcription of Piano Music »
Taylor Berg-Kirkpatrick · Jacob Andreas · Dan Klein -
2014 Spotlight: Unsupervised Transcription of Piano Music »
Taylor Berg-Kirkpatrick · Jacob Andreas · Dan Klein