Attributing Model Behavior at Scale (ATTRIB)

Workshop

Attributing Model Behavior at Scale (ATTRIB)

Tolga Bolukbasi · Logan Engstrom · Kelvin Guu · Andrew Ilyas · Sam Park · Ellie Pavlick · Anders Søgaard

Fri 15 Dec, 6:45 a.m. PST

[ Abstract ] Workshop Website

Recently-developed algorithmic innovations (e.g., transformers, diffusion models ) and large-scale datasets (e.g., Common Crawl, LAION) have given rise to machine learning models with impressive capabilities. However, there is much left to understand in how these different factors combine to give rise to observed behaviors. For example, we still do not fully understand how the composition of training datasets influence downstream model capabilities (e.g., which data sources within LAION-5B are important for training high-quality CLIP embeddings?), how to attribute model capabilities to subcomponents inside the model(e.g., can we identify which subnetwork of a LLM implements addition ?), and which algorithmic choices really drive performance (e.g., is RL necessary to align language models?).A common theme underlying all these challenges is model behavior attribution. That is, the need to tie model behavior back to factors in the machine learning pipeline---such as the choice of training dataset or particular training algorithm---that we can control or reason about. This workshop aims to bring together researchers and practitioners that advance our understanding of model behavior attribution in the contexts that span: data, models, and learning algorithms.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 6:45 a.m. - 7:00 a.m.	Welcome and Opening Remarks ( Remarks ) > SlidesLive Video	🔗
Fri 7:00 a.m. - 7:30 a.m.	Data attribution for LMMs and beyond (James Zou) ( In-person presentation ) > SlidesLive Video	🔗
Fri 7:30 a.m. - 8:00 a.m.	What does scale give us: Why we are building a ladder to the moon (Sara Hooker) ( In-person presentation ) > SlidesLive Video	🔗
Fri 8:00 a.m. - 8:30 a.m.	Coffee Break and Posters	🔗
Fri 8:30 a.m. - 9:05 a.m.	Contributed papers (4 presentations) ( Contributed Talk ) > SlidesLive Video	Elan Rosenfeld · Rhys Gould · Nicholas Konz · Theodora Worledge 🔗
Fri 9:05 a.m. - 9:50 a.m.	The Future of Attribution in ML (Panel) ( Discussion Panel ) > SlidesLive Video	🔗
Fri 9:50 a.m. - 11:00 a.m.	Lunch	🔗
Fri 11:00 a.m. - 12:00 p.m.	Poster Session #1 ( Poster Session ) >	🔗
Fri 12:00 p.m. - 12:30 p.m.	What Neural Networks Memorize and Why (Vitaly Feldman) ( In-person presentation ) > SlidesLive Video	🔗
Fri 12:30 p.m. - 1:00 p.m.	Evaluation Beyond Task Performance (Milad Nasr) ( In-person presentation ) > SlidesLive Video	🔗
Fri 1:00 p.m. - 2:00 p.m.	Poster Session #2 ( Poster Session ) >	🔗
Fri 1:00 p.m. - 1:30 p.m.	Coffee Break and Posters	🔗
Fri 2:00 p.m. - 2:30 p.m.	Understanding LLMs via their Generative Successes and Shortcomings (Swabha Swayamdipta) ( In-person presentation ) > SlidesLive Video	🔗
Fri 2:30 p.m. - 3:00 p.m.	Talk by Sanjeev Arora ( In-person presentation ) > SlidesLive Video	🔗
Fri 3:00 p.m. - 3:30 p.m.	Poster Session #3 & Closing Remarks ( Poster Session ) >	🔗
-	Irreducible Curriculum for Language Model Pretraining ( Poster ) > link Link	Simin Fan · Martin Jaggi 🔗
-	Evaluating the Utility of Model Explanations for Model Development ( Poster ) > link Link	Shawn Im · Jacob Andreas · Yilun Zhou 🔗
-	Why do landscape diagnostics matter? Pinpointing the failure mode of generalization ( Poster ) > link Link	Yefan Zhou · Jianlong Chen · Qinxue Cao · Konstantin Schürholt · Yaoqing Yang 🔗
-	The Importance of Prompt Tuning for Automated Neuron Explanations ( Poster ) > link Link	Justin Lee · Tuomas Oikarinen · Arjun Chatha · Keng-Chi Chang · Yilan Chen · Lily Weng 🔗
-	Copy Suppression: Comprehensively Understanding an Attention Head ( Poster ) > link Link	Callum McDougall · Arthur Conmy · Cody Rushing · Tom McGrath · Neel Nanda 🔗
-	Does It Know?: Probing and Benchmarking Uncertainty in Language Model Latent Beliefs ( Poster ) > link Link	Brian Huang · Joe Kwon 🔗
-	Attribution Patching Outperforms Automated Circuit Discovery ( Poster ) > link Link	Aaquib Syed · Can Rager · Arthur Conmy 🔗
-	On the Support Vector Effect in DNNs: Rethinking Last Layer Sensitivity-based Instance Attribution ( Poster ) > link Link	Syed Hasan Amin Mahmood · Rajiv Khanna 🔗
-	Training Dynamics of Contextual N-Grams in Language Models ( Poster ) > link Link	Lucia Quirke · Lovis Heindrich · Wes Gurnee · Neel Nanda 🔗
-	SPADE: Sparsity-Guided Debugging for Deep Neural Networks ( Poster ) > link Link	Arshia Soltani Moakhar · Eugenia Iofinova · Dan Alistarh 🔗
-	In Search of a Data Transformation that Accelerates Neural Field Training ( Poster ) > link SlidesLive Video Link	Junwon Seo · Sangyoon Lee · Jaeho Lee 🔗
-	Automatic Discovery of Visual Circuits ( Poster ) > link Link	Achyuta Rajaram · Neil Chowdhury · Antonio Torralba · Jacob Andreas · Sarah Schwettmann 🔗
-	Mining the Diamond Miner: Mechanistic Interpretability on the Video PreTraining Agent ( Poster ) > link Link	Sonia Joseph · Artem Zholus · Mohammad Reza Samsami · Blake Richards 🔗
-	Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation (Workshop Version) ( Poster ) > link Link	Jiachen (Tianhao) Wang · Yuqing Zhu · Yu-Xiang Wang · Ruoxi Jia · Prateek Mittal 🔗
-	Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study ( Poster ) > link SlidesLive Video Link	Karolis Ramanauskas · Özgür Şimşek 🔗
-	Adversarial Attacks on Neuron Interpretation via Activation Maximization ( Poster ) > link Link	Alex Fulleringer · Geraldin Nanfack · Jonathan Marty · Michael Eickenberg · Eugene Belilovsky 🔗
-	Divergence at the Interpolation Threshold: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle ( Poster ) > link SlidesLive Video Link	Rylan Schaeffer · Zachary Robertson · Akhilan Boopathy · Mikail Khona · Ila Fiete · Andrey Gromov · Sanmi Koyejo 🔗
-	The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" ( Poster ) > link Link	Lukas Berglund · Meg Tong · Maximilian Kaufmann · Mikita Balesni · Asa Cooper Stickland · Tomasz Korbak · Owain Evans 🔗
-	The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets ( Poster ) > link Link	Samuel Marks · Max Tegmark 🔗
-	Language Models Linearly Represent Sentiment ( Poster ) > link Link	Curt Tigges · Oskar John Hollinsworth · Atticus Geiger · Neel Nanda 🔗
-	Efficient Data Valuation for Weighted Nearest Neighbor Algorithms ( Poster ) > link Link	Jiachen (Tianhao) Wang · Ruoxi Jia 🔗
-	How do language models bind entities in context? ( Poster ) > link Link	Jiahai Feng · Jacob Steinhardt 🔗
-	Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching ( Poster ) > link Link	Aleksandar Makelov · Georg Lange · Atticus Geiger · Neel Nanda 🔗
-	Object Detection in Deep Neural Networks Differs from Humans in the Periphery ( Poster ) > link Link	Anne Harrington · Vasha DuTell · Mark Hamilton · Ayush Tewari · Simon Stent · Bill Freeman · Ruth Rosenholtz 🔗
-	Risk Aversion of Online Learning Algorithms ( Poster ) > link Link	Andreas Haupt · Aroon Narayanan 🔗
-	Tell, Don't Show: Internalized Reasoning influences how LLMs generalize ( Poster ) > link Link	Alexander Meinke · Owain Evans 🔗
-	Formal Definition of Fingerprints Improves Attribution of Generative Models ( Poster ) > link Link	Hae Jin Song · Mahyar Khayatkhoei · Wael Abd-Almageed 🔗
-	Attributing Learned Concepts in Neural Networks to Training Data ( Oral ) > link Link	Nicholas Konz · Charles Godfrey · Madelyn Shapiro · Jonathan Tu · Henry Kvinge · Davis Brown 🔗
-	When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale ( Poster ) > link Link	Max Marion · Ahmet Üstün · Luiza A Pozzobon · Alex Wang · Marzieh Fadaee · Sara Hooker 🔗
-	A Simple and Efficient Baseline for Data Attribution on Images ( Poster ) > link Link	Vasu Singla · Pedro Sandoval-Segura · Micah Goldblum · Jonas Geiping · Tom Goldstein 🔗
-	Shapley Interactions for Complex Feature Attribution ( Poster ) > link Link	Divyansh Singhvi · Andrej Erkelens · Raghav Jain · Diganta Misra · Naomi Saphra 🔗
-	Sparse Autoencoders Find Highly Interpretable Features in Language Models ( Poster ) > link Link	Hoagy Cunningham · Aidan Ewart · Logan Smith · Robert Huben · Lee Sharkey 🔗
-	Successor Heads: Recurring, Interpretable Attention Heads In The Wild ( Oral ) > link Link	Rhys Gould · Euan Ong · George Ogden · Arthur Conmy 🔗
-	Exploring Dataset-Scale Indicators of Data Quality ( Poster ) > link Link	Benjamin Feuer · Chinmay Hegde 🔗
-	Self-Select: Optimizing Instruction Selection for Large Language Models ( Poster ) > link Link	Alexander Kyimpopkin · Keshav Ramji 🔗
-	Speculative Behavior: An Approach to Large Language Model Evaluation and Optimization ( Poster ) > link SlidesLive Video Link	Hernan C. Vazquez · Jorge Sánchez · Rafael Carrascosa 🔗
-	Unifying Corroborative and Contributive Attributions in Large Language Models ( Oral ) > link Link	Theodora Worledge · Judy Hanwen Shen · Nicole Meister · Caleb Winston · Carlos Guestrin 🔗
-	Algorithm Selection with Priority Order for Instances ( Poster ) > link Link	Zhamilya Saparova · Martin Lukac 🔗
-	Better than Balancing: Debiasing through Data Attribution ( Poster ) > link Link	Saachi Jain · Kimia Hamidieh · Kristian Georgiev · Marzyeh Ghassemi · Aleksander Madry 🔗
-	Prototype Generation: Robust Feature Visualisation for Data Independent Interpretability ( Poster ) > link Link	Arush Tagade · Jessica Rumbelow 🔗
-	Backtracking Mathematical Reasoning of Language Models to the Pretraining Data ( Poster ) > link Link	Yasaman Razeghi · Hamish Ivison · Sameer Singh · Yanai Elazar 🔗
-	Intriguing Properties of Data Attribution on Diffusion Models ( Poster ) > link Link	Xiaosen Zheng · Tianyu Pang · Chao Du · Jing Jiang · Min Lin 🔗
-	Forbidden Facts: An Investigation of Competing Objectives in Llama 2 ( Poster ) > link SlidesLive Video Link	Tony Wang · Miles Wang · Kaivalya Hariharan · Nir Shavit 🔗
-	Towards Best Practices of Activation Patching in Language Models: Metrics and Methods ( Poster ) > link Link	Fred Zhang · Neel Nanda 🔗
-	Meta- (out-of-context) learning in neural networks ( Poster ) > link Link	Dmitrii Krasheninnikov · Egor Krasheninnikov · Bruno Mlodozeniec · David Krueger 🔗
-	Transformer-based Causal Language Models from a Meta-Learning Perspective ( Poster ) > link Link	Xinbo Wu · Lav Varshney 🔗
-	Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization ( Oral ) > link Link	Elan Rosenfeld · Andrej Risteski 🔗
-	Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism ( Poster ) > link Link	Mansi Sakarvadia · Arham Khan · Aswathy Ajith · Daniel Grzenda · Nathaniel Hudson · André Bauer · Kyle Chard · Ian Foster 🔗
-	Estimating the Generalization in Deep Neural Networks via Sparsity ( Poster ) > link Link	Yang Zhao · Hao Zhang · Xiuyuan Hu 🔗
-	Data Attribution for Segmentation Models ( Poster ) > link Link	Albert Tam · Joshua Vendrow · Aleksander Madry 🔗
-	Summing Up the Facts: Additive Mechanisms behind Factual Recall in LLMs ( Poster ) > link Link	Bilal Chughtai · Alan Cooney · Neel Nanda 🔗