Text‑Guided Data Attribution: Attributing the Influence of Simplicity Bias to Dataset
Abstract
The effectiveness of deep learning models heavily relies on the quality and diversity of their training data. However, datasets collected from different sources often introduce simplicity biases, where a model relies on easily learnable but non-predictive (spurious) features for its predictions. While existing debiasing techniques focus on model robustness, they leave the data untouched. Further, they require manual group annotation of the entire training data or changes in training strategy, which are often constrained by privacy, regulatory, or proprietary constraints. As data becomes increasingly valuable, identifying and mitigating bias directly at the data level has gained importance. Recently, data attribution has emerged as a promising tool for uncovering issues in training data, yet its vulnerability to simplicity bias has received limited attention. In this work, we propose a novel data deletion framework that combines Neural Tangent Kernel (NTK)-based data attribution with textual descriptions of bias to identify and remove training samples that do not significantly affect model performance. We first demonstrate that NTK-based data attribution methods can themselves be influenced by spurious features. Subsequently, to mitigate this, we use available metadata or, when unavailable, a vision-language model, to annotate a small validation set and extract a textual description of the bias. Based on this description, we identify training samples that are semantically aligned with the spurious feature and exhibit high detrimental attribution scores. Removing these samples from the training data and retraining the model on the new training set improves its performance. Our approach achieves better average and worst-group accuracy, outperforming existing attribution-based baselines.