Trust, But Attribute: Tracing Impact of Data on Trustworthiness in Supervised LLM Fine-Tuning
Kumar Shubham · Nishant Sharma · Karn Tiwari · Prathosh AP
Abstract
Supervised fine-tuning (SFT) improves the perplexity of the large language model (LLM), but can also degrade its trustworthiness, leading to the generation of untruthful, biased, or unsafe content during user interactions. These problems are often traced to specific phrases or patterns in the training data. However, correcting them usually requires expensive retraining or new data collection. In this work, we propose a two-stage, compute-efficient repair of the post-SFT models that enhances trustworthiness while preserving downstream performance. In the first stage, we identify the training samples responsible for failures on truthfulness, stereotypical bias, and machine ethics. To enable efficient repair, we then select a small and diverse subset of these examples using determinantal point process (DPP) based regularization. In the second stage, we repair the model under the framework of Proximal Bregman Response Function (PBRF) using a gradient ascent-based parameter update, which enhances trustworthiness while preserving perplexity of the downstream task. We evaluate our method on multiple LLMs of varying sizes and demonstrate up to 19\% improvement in trustworthiness metrics with minimal impact ($\leq1\%$) on perplexity. Our method repairs fine-tuned models within seconds and offers a practical alternative to hours of retraining required for model repair.
Chat is not available.
Successful Page Load