Poster
in
Workshop: Reliable ML from Unreliable Data

Disarming Strategic Text: Span-Aware Counterfactuals for Robust Content Moderation

Hardik Meisheri ⋅ Zaid Hassan ⋅ Karthik Sankaranarayanan

2025 Poster
in
Workshop: Reliable ML from Unreliable Data

Project Page [ OpenReview]

Abstract

Machine learning systems deployed in the wild must operate reliably despite unreliable inputs, whether arising from distribution shifts, adversarial manipulation, or strategic behavior by users. Content moderation is a prime example: violators deliberately exploit euphemisms, obfuscations, or benign co-occurrence patterns to evade detection, creating unreliable supervision signals for classifiers. We present a span-aware augmentation framework that generates high-quality counterfactual hard negatives to improve robustness under such conditions. Our pipeline combines (i) multi-LLM agreement to extract causal violation spans, (ii) policy-guided rewrites of those spans into compliant alternatives, and (iii) validation via re-inference to ensure only genuine label-flipping counterfactuals are retained. Across real-world ad moderation and toxic comment datasets, this approach consistently reduces spurious correlations and improves robustness to adversarial triggers, with PRAUC gains of up to +6.3 points. We further show that augmentation benefits peak at task-dependent ratios, underscoring the importance of balance in reliable learning. These findings highlight span-aware counterfactual augmentation as a practical path toward reliable ML from strategically manipulated and unreliable text data.

Chat is not available.