Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

MoPe: Model Perturbation-based Privacy Attacks on Language Models

Jason Wang · Jeffrey Wang · Marvin Li · Seth Neel


Abstract: Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present MoPe (Model Perturbations), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe adds noise to the model in parameter space and measures the drop in the log-likelihood for a given point x, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. We compare MoPe to existing state-of-the-art loss-based attacks and other attacks based on second-order curvature information (such as the trace of the Hessian with respect to the model input). Across language models ranging from size 70M to 12B parameters, we show that MoPe is more effective than existing attacks. We also find that the loss of a point alone is insufficient to determine extractability---there are training points we can recover using our methods that have average loss. This casts some doubt on prior work that uses the loss of a point as evidence of memorization or "unlearning."

Chat is not available.