NeurIPS Expo Demonstration Learning to Steer LLMs with AI Steerability 360 and In-Context Explainability 360

Expo Demonstration

Learning to Steer LLMs with AI Steerability 360 and In-Context Explainability 360

Erik Miehling · Dennis Wei

Upper Level Room 29A-D

[ Abstract ]

Tue 2 Dec noon PST — 3 p.m. PST

Abstract:

Current algorithms for aligning LLM behavior are often implemented for narrow settings, making it difficult for researchers and developers to understand their effectiveness across model architectures, datasets, and tasks. To help provide a more informed and principled approach to steering model behavior, we present the AI Steerability 360 (AISteer360) and In-Context Explainability 360 (ICX360) toolkits. Participants will first be guided through a conceptual overview for how model behavior can be influenced across four model control surfaces: input (prompting), structural (weights/architecture), state (activations/attentions), and output (decoding). After the conceptual overview, we will guide attendees through how to apply some recently developed explainability tools (from ICX360) for understanding why models produce given, potentially undesirable, outputs and how this information is used to design targeted steering inventions (via AISteer360). Closing the loop, we will evaluate if the baseline behavior (of the original, unsteered model) was successfully mitigated by the selected steering inventions and investigate if steering introduced any unintended behavioral side-effects. All of the experiments throughout the demonstration will be facilitated solely by the tools in the two toolkits, illustrating their power to design end-to-end steering workflows. Attendees will come away with a practical understanding of how to apply these toolkits to their own alignment challenges.

Chat is not available.