Persona-Vector Routing: A Lightweight, Interpretable Guardrail for Mitigating LLM Hallucinations
Ayesha Imran · Soham Chatterjee
Abstract
We study a lightweight, interpretable routing-and-steering guardrail based on persona vectors to mitigate hallucinations in large language models (LLMs) under strict latency budgets. Our method leverages an interpretable hallucination persona vector ($v_\text{halluc}$) extracted from model activations to predict the risk that a prompt will elicit a hallucinated response. We train a lightweight classifier on the projection of the prompt's final-token activation onto this vector to score hallucination risk. Prompts are then routed via a two-tier mechanism: Tier 1 (Low Risk) prompts are answered normally with no overhead; Tier 2 (High Risk) prompts are dynamically mitigated by steering the model's activations along $v_\text{halluc}$ during generation (effectively suppressing hallucination propensity) with minimal extra cost. On the TruthfulQA benchmark (evaluating factual accuracy), our guardrail -- implemented on Llama-3.1-8B -- achieves $\approx 20.6\%$ relative hallucination reduction ($51.2\%$ vs.\ $38.6\%$ accuracy) while reducing average inference latency by $\sim 8\%$ compared to the base model. This matches deployment constraints ($\leq 15\%$ slowdown) and outperforms more expensive baselines like multi-pass methods that increase the number of forward passes roughly linearly with the number of samples or verification steps. We additionally conducted secondary validation on Mistral-7B; these runs reproduce the qualitative trends observed on Llama-3.1-8B and show smaller but consistent gains on TruthfulQA and MedChat-QA.
Chat is not available.
Successful Page Load