Poster
in
Workshop: NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models

ShieldBench: A Comprehensive Benchmark for Evaluating the Persistence of LLM Safety Interventions

Mert Ogul ⋅ Rishitha Voleti ⋅ Shanduojiao Jiang ⋅ Kevin Zhu

Project Page [ Poster] [ OpenReview]

Abstract

Large Language Models (LLMs) are increasingly relied upon for information access and decision support, yet they continue to struggle with distinguishing between benign and harmful prompts. Existing evaluation protocols fall short: some rely on unrealistic assumptions, while others provide only partial assessments of model safety. We introduce ShieldBench, a benchmark designed to evaluate not only immediate safety compliance but also the persistence of safety interventions under realistic usage conditions. Our benchmark incorporates a suite of recent weight-space editing techniques (Task-Vector Negation, Diverse Inversion, Guided Distortion, AlphaEdit, SafetyLora, and TaLoS Sparsity) applied across multiple open-source models and diverse safety datasets like HarmBench. By evaluating performance under both greedy and sampling-based decoding, we capture conditions closer to real world deployments. Our results reveal persistence depends critically on weight-space geometry, providing actionable insights for building durable LLM safety.

Chat is not available.