Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Paraphrasing Away Malicious Tokens: Improving LLM Finetuning Safety by Filtering Spurious Correlation

Marcel Mateos Salles ⋅ Praney Goyal ⋅ Pradyut Sekhsaria ⋅ Hai Huang ⋅ Randall Balestriero

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) are increasingly adapted to classification-style tasks through Low-Rank Adaptation (LoRA). While LoRA provides strong performance at low cost, we find it introduces a major security vulnerability: susceptibility to Seamless Spurious Token Injection (SSTI). In SSTI, even a single token spuriously correlated with downstream labels can dominate model predictions, either through accidental data artifacts or intentional dataset poisoning. We conduct comprehensive experiments across three model families (Meta LLaMA-3, Apple OpenELM, and Snowflake Arctic) and four diverse datasets (IMDB, Financial Classification, CommonSenseQA, and Bias in Bios), and evaluate the impact of using LLMs for paraphrasing as a defense mechanism. Our findings reveal: (1) minimal injection—just one token per prompt—is sufficient to steer model outputs; and (2) paraphrasing serves as a partial defense against easy SSTI. Together, our results expose a critical tradeoff between efficiency and robustness in LoRA finetuning, raising new concerns for both data quality and model security.

Chat is not available.