Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

HarmAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Jing-Jing Li ⋅ Valentina Pyatkin ⋅ Max Kleiman-Weiner ⋅ Liwei Jiang ⋅ Nouha Dziri ⋅ Anne Collins ⋅ Jana Schaich Borg ⋅ Maarten Sap ⋅ Yejin Choi ⋅ Sydney Levine

Keywords: LLM content moderation pluralistic alignment AI safety interpretability

Project Page [ OpenReview]

Abstract

Current LLM content moderation systems lack sufficiently interpretable features and transparent mechanisms for making reliable AI safety decisions. Moreover, such systems are often aligned to the values held by an anonymous group of annotators of unknown identity, which may lead to suboptimal moderation, especially when different values would lead to different safety judgments. To address this gap, we present HarmAnalyst, a novel content moderation framework and system that generates interpretable features to make controllable, mechanistic decisions for prompt classification or refusal. Importantly, HarmAnalyst can be efficiently and flexibly aligned to content label distributions that reflect pluralistic values. Given a prompt, our system creates a structured ``harm-benefit tree'', which identifies 1) the actions that could be taken if a compliant response were provided and 2) the harmful and beneficial effects of those actions (along with their severity, likelihood, and immediacy) on 3) relevant stakeholders that may be impacted. HarmAnalyst then aggregates this structured representation into a harmfulness metric based on a parameterized set of safety preferences. Using extensive harm-benefit features generated on 17k prompts, we show promising preliminary application of our system to harmful prompt classification (AUROC $\geq$ 0.88) and propose a new paradigm for pluralistic value alignment.

Chat is not available.