From Rules to Pixels: A Decoupled Framework for Segmenting Human-Centric Rule Violations
Mohd Hozaifa Khan · Harsh Awasthi · Pragati Jain · Mohammad Ammar
Abstract
We introduce LaGPS, a neuro-symbolic framework that grounds long-form textual rules, such as cultural dress codes, by translating them into deterministic programs for segmentation of rule violations\footnote{Here, "violation" is used in a strictly technical sense to denote pixels where a *user-specified* visual condition is not met; it carries no moral, cultural, or legal implication.}. Existing vision-language models struggle with this task because they cannot parse the compositional logic inherent in human rules. LaGPS overcomes this limitation with a two-stage architecture: a *Semantic Interpreter* that uses a large language model to compile free-form text into a structured program, and a *Symbolic Executor* that runs this program over a set of visual primitives (e.g., per-person body parts, skin masks, etc) to produce precise segmentation masks. To evaluate this setting, we introduce the *Human-Centric Rule-violation Segmentation (HRS)* benchmark for this task, a new $1,100$ image dataset spanning diverse cultural contexts. LaGPS significantly outperforms baselines like CLIPSeg, achieving a $+19.4\%$ absolute mIoU improvement. Our work demonstrates that this decoupled approach creates more transparent, accurate, and auditable systems for language-guided visual reasoning.
Successful Page Load