ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs
Abstract
Prompt design significantly impacts the moral competence and safety alignment of large language models (LLMs), yet empirical comparisons remain fragmented across datasets and models. We introduce ProMoral-Bench, a unified benchmark evaluating eleven prompting paradigms across four LLM families on ETHICS, Scruples, WildJailbreak, and our newly constructed ETHICS-Contrast set, a minimal-edit robustness test probing whether models preserve or invert moral judgments under controlled perturbations. We also propose the Unified Moral Safety Score (UMSS), a harmonic-mean composite jointly rewarding balanced task accuracy and safe response behavior. Results reveal that compact, exemplar-guided scaffolds consistently outperform multi-stage deliberative prompting, achieving superior UMSS scores at a fraction of the token cost. Multi-turn reasoning prompts exhibit the steepest robustness decline on ETHICS-Contrast, while few-shot exemplars generalize most reliably. The same trend extends to WildJailbreak, where exemplification and concise scaffolding enhance both moral stability and refusal consistency. ProMoral-Bench provides the first standardized foundation for studying prompt-level trade-offs between moral competence and safety, establishing a pathway toward more principled and cost-effective prompt engineering.