Skip to yearly menu bar Skip to main content

Workshop

Foundation Model Interventions

Pau Rodriguez ⋅ Arno Blaas ⋅ Desi R Ivanova ⋅ Sahra Ghalebikesabi ⋅ Yuki M Asano ⋅ Katherine Metcalf ⋅ Xavier Suau

Project Page [ OpenReview]

Abstract

The increasing capabilities of foundation models have raised concerns about their potential to generate undesirable content, perpetuate biases, and promote harmful behaviors. To address these issues, we propose a workshop that focuses on understanding the inner workings of foundation models and identifying actionable mechanisms involved in generation. Recent studies have shown promise in directly intervening on model activations or a low-rank subset of the weights to provide fine-grained control over model generation to mitigate the generation of harmful and toxic content. This workshop aims to bring together researchers to explore methods for improving the controllability of foundation models and developing a better understanding of their behavior and potential misuse.

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

8:45 AM

Welcome and Opening Remarks

Video

9:00 AM

Atticus Geiger: The Current State of Interpretability and Ideas for Scaling Up

Atticus Geiger

Video

9:45 AM

Spotlight Talks

9:45 AM

LoFiT: Localized Fine-tuning on LLM Representations

Fangcong Yin ⋅ Xi Ye ⋅ Greg Durrett

Video

9:51 AM

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah ⋅ Andrew Ilyas ⋅ Aleksander Madry

9:57 AM

Analyzing (In)Abilities of SAEs via Formal Languages

Abhinav Menon ⋅ Manish Shrivastava ⋅ David Krueger ⋅ Ekdeep S Lubana

Video

10:03 AM

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs

Itamar Pres ⋅ Laura Ruis ⋅ Ekdeep S Lubana ⋅ David Krueger

Video

10:09 AM

Probing the Decision Boundaries of In-context Learning in Large Language Models

Siyan Zhao

Video

10:15 AM

Coffee Break

10:45 AM

Fernanda Viégas: AI Dashboard Design: A User-Centered Approach to Interpretability

Fernanda Viégas

Video

11:30 AM

Junior Panel Discussion

Video

12:00 PM

Lunch Break

1:00 PM

Poster Session

2:00 PM

David Ha: The Future of Collective Intelligence and Meta Evolution for Foundation Models

David Ha

Video

2:45 PM

Coffe Break

3:15 PM

Jacob Steinhardt: Scalably Understanding AI with AI

Jacob Steinhardt

Video

4:00 PM

Panel Discussion

Fernanda Viégas ⋅ Neel Nanda ⋅ Atticus Geiger ⋅ Jacob Steinhardt

Video

4:55 PM

Closing Remarks and Award Ceremony

Video

Overcoming Limitations of Steering Vectors with Low-Rank Representation Steering

Dmitrii Krasheninnikov ⋅ David Krueger

Do LLMs internally ``know'' when they follow instructions?

Juyeon Heo ⋅ Christina Heinze-Deml ⋅ Shirley Ren ⋅ Oussama Elachqar ⋅ Udhyakumar Nallasamy ⋅ Andy Miller ⋅ Jaya Narain

LoFiT: Localized Fine-tuning on LLM Representations

Fangcong Yin ⋅ Xi Ye ⋅ Greg Durrett

Ablation is Not Enough to Emulate DPO: A Mechanistic Analysis of Toxicity Reduction

Yushi Yang ⋅ Filip Sondej ⋅ Harry Mayne ⋅ Adam Mahdi

Is Free Self-Alignment Possible?

Dyah Adila ⋅ Changho Shin ⋅ Yijing Zhang ⋅ Frederic Sala

Steering semantic search with interpretable features from sparse autoencoders

Christine Ye ⋅ Charles O'Neill ⋅ John Wu ⋅ Kartheik Iyer

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

Ido Sobol ⋅ Chenfeng Xu ⋅ Or Litany

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

Joris Postmus ⋅ Steven Abreu

Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions

Marc Canby ⋅ Adam Davies ⋅ Chirag Rastogi ⋅ Julia C Hockenmaier

Uncovering Uncertainty in Transformer Inference

Greyson Brothers ⋅ Willa Mannering ⋅ John Winder ⋅ Amber Tien

Algorithmic Oversight for Deceptive Reasoning

Ege Onur Taga ⋅ Mingchen Li ⋅ Yongqi Chen ⋅ Samet Oymak

Probing the Decision Boundaries of In-context Learning in Large Language Models

Siyan Zhao ⋅ Tung Nguyen ⋅ Aditya Grover

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

Madeline Brumley ⋅ Joe Kwon ⋅ David Krueger ⋅ Dmitrii Krasheninnikov ⋅ Usman Anwar

Linearly Controlled Language Generation with Performative Guarantees

Emily Cheng ⋅ Marco Baroni ⋅ Carmen Amo Alonso

Entropy-Based Decoding for Retrieval-Augmented Large Language Models

Zexuan Qiu ⋅ Zijing Ou ⋅ Bin Wu ⋅ Jingjing Li ⋅ Aiwei Liu ⋅ Irwin King

Toward Explanation Bottleneck Models

Shin'ya Yamaguchi ⋅ Kosuke Nishida

Can sparse autoencoders be used to decompose and interpret steering vectors?

Harry Mayne ⋅ Yushi Yang ⋅ Adam Mahdi

WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models

Peng Wang ⋅ Zexi Li ⋅ Ningyu Zhang ⋅ Ziwen Xu ⋅ Yunzhi Yao ⋅ Yong Jiang ⋅ Pengjun Xie ⋅ Fei Huang ⋅ Huajun Chen

Representation Tuning

Christopher Ackerman

SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

Carter Teplica ⋅ Yixin Liu ⋅ Arman Cohan ⋅ Tim G. J. Rudner

Understanding Visual Concepts Across Models

Brandon Trabucco ⋅ Max Gurinas ⋅ Kyle Doherty ⋅ Ruslan Salakhutdinov

Secret Seeds in Text-to-Image Diffusion Models

Katherine Xu ⋅ Lingzhi Zhang ⋅ Jianbo Shi

Analyzing (In)Abilities of SAEs via Formal Languages

Abhinav Menon ⋅ Manish Shrivastava ⋅ Ekdeep S Lubana ⋅ David Krueger

Pay Attention to What Matters

Pedro Silva ⋅ Fadhel Ayed ⋅ Antonio De Domenico ⋅ Ali Maatouk

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah ⋅ Andrew Ilyas ⋅ Aleksander Madry

Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models

Xinyu Zhou ⋅ Delong Chen ⋅ Samuel Cahyawijaya ⋅ Xufeng Duan ⋅ Zhenguang Cai

Semantic Entropy Neurons: Encoding Semantic Uncertainty in the Latent Space of LLMs

Jiatong Han ⋅ Jannik Kossen ⋅ Muhammed Razzak ⋅ Yarin Gal

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs

Itamar Pres ⋅ Laura Ruis ⋅ Ekdeep S Lubana ⋅ David Krueger

Unveiling and Manipulating Concepts in Time Series Foundation Models

Michal Wilinski ⋅ Mononito Goswami ⋅ Nina Żukowska ⋅ Willa Potosnak ⋅ Artur Dubrawski

GPT-2 Small Fine-Tuned on Logical Reasoning Summarizes Information on Punctuation Tokens

Sonakshi Chauhan ⋅ Atticus Geiger

Extracting Paragraphs from LLM Token Activations

Nicky Pochinkov ⋅ Angelo Benoit ⋅ Lovkush Agarwal ⋅ Zainab Ali Majid ⋅ Lucile Ter-Minassian

Analysing the Residual Stream of Language Models Under Knowledge Conflicts

Yu Zhao ⋅ Xiaotang Du ⋅ Giwon Hong ⋅ Aryo Gema ⋅ Alessio Devoto ⋅ Hongru WANG ⋅ Xuanli He ⋅ Kam-Fai Wong ⋅ Pasquale Minervini

Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks

Gregory Kang Ruey Lau ⋅ Wenyang Hu ⋅ Liu Diwen ⋅ Chen Jizhuo ⋅ See-Kiong Ng ⋅ Bryan Kian Hsiang Low