Timezone: »
Many interpretation methods for neural models in natural language processing investigate how information is encoded inside hidden representations. However, these methods can only measure whether the information exists, not whether it is actually used by the model. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. The approach enables us to analyze the mechanisms that facilitate the flow of information from input to output through various model components, known as mediators. As a case study, we apply this methodology to analyzing gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are concentrated in specific components of the model that may exhibit highly specialized behavior.
Author Information
Jesse Vig (Salesforce Research)
Sebastian Gehrmann (Harvard University)
Yonatan Belinkov (Technion)
Sharon Qian (Harvard)
Daniel Nevo (Tel Aviv University)
Yaron Singer (Harvard University)
Stuart Shieber (Harvard University)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Poster: Investigating Gender Bias in Language Models Using Causal Mediation Analysis »
Tue. Dec 8th 05:00 -- 07:00 AM Room Poster Session 0 #58
More from the Same Authors
-
2021 : Automatic Construction of Evaluation Suites for Natural Language Generation Datasets »
Simon Mille · Kaustubh Dhole · Saad Mahamood · Laura Perez-Beltrachini · Varun Prashant Gangal · Mihir Kale · Emiel van Miltenburg · Sebastian Gehrmann -
2021 : SynthBio: A Case Study in Faster Curation of Text Datasets »
Ann Yuan · Daphne Ippolito · Vitaly Nikolaev · Chris Callison-Burch · Andy Coenen · Sebastian Gehrmann -
2021 : [O4] Are All Neurons Created Equal? Interpreting and Controlling BERT through Individual Neurons »
Omer Antverg · Yonatan Belinkov -
2021 Poster: IRM—when it works and when it doesn't: A test case of natural language inference »
Yana Dranker · He He · Yonatan Belinkov -
2020 : Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models »
Jesse Vig · Ali Madani -
2020 Demonstration: LMdiff: A Visual Diff Tool to Compare LanguageModels »
Hendrik Strobelt · Benjamin Hoover · Arvind Satyanarayan · Sebastian Gehrmann -
2020 Poster: The Adaptive Complexity of Maximizing a Gross Substitutes Valuation »
Ron Kupfer · Sharon Qian · Eric Balkanski · Yaron Singer -
2020 Poster: An Optimal Elimination Algorithm for Learning a Best Arm »
Avinatan Hassidim · Ron Kupfer · Yaron Singer -
2020 Spotlight: An Optimal Elimination Algorithm for Learning a Best Arm »
Avinatan Hassidim · Ron Kupfer · Yaron Singer -
2020 Spotlight: The Adaptive Complexity of Maximizing a Gross Substitutes Valuation »
Ron Kupfer · Sharon Qian · Eric Balkanski · Yaron Singer -
2019 Poster: Fast Parallel Algorithms for Statistical Subset Selection Problems »
Sharon Qian · Yaron Singer -
2018 Poster: Optimization for Approximate Submodularity »
Yaron Singer · Avinatan Hassidim -
2018 Poster: Non-monotone Submodular Maximization in Exponentially Fewer Iterations »
Eric Balkanski · Adam Breuer · Yaron Singer -
2017 Workshop: Discrete Structures in Machine Learning »
Yaron Singer · Jeff A Bilmes · Andreas Krause · Stefanie Jegelka · Amin Karbasi -
2017 Poster: Minimizing a Submodular Function from Samples »
Eric Balkanski · Yaron Singer -
2017 Poster: Robust Optimization for Non-Convex Objectives »
Robert S Chen · Brendan Lucier · Yaron Singer · Vasilis Syrgkanis -
2017 Oral: Robust Optimization for Non-Convex Objectives »
Robert S Chen · Brendan Lucier · Yaron Singer · Vasilis Syrgkanis -
2017 Poster: The Importance of Communities for Learning to Influence »
Eric Balkanski · Nicole Immorlica · Yaron Singer -
2016 Poster: Maximization of Approximately Submodular Functions »
Thibaut Horel · Yaron Singer -
2016 Poster: The Power of Optimization from Samples »
Eric Balkanski · Aviad Rubinstein · Yaron Singer -
2015 Poster: Learnability of Influence in Networks »
Harikrishna Narasimhan · David Parkes · Yaron Singer -
2015 Poster: Information-theoretic lower bounds for convex optimization with erroneous oracles »
Yaron Singer · Jan Vondrak -
2015 Spotlight: Information-theoretic lower bounds for convex optimization with erroneous oracles »
Yaron Singer · Jan Vondrak -
2014 Workshop: Discrete Optimization in Machine Learning »
Jeffrey A Bilmes · Andreas Krause · Stefanie Jegelka · S Thomas McCormick · Sebastian Nowozin · Yaron Singer · Dhruv Batra · Volkan Cevher