Spotlight Poster
Watermarking Makes Language Models Radioactive
Tom Sander · Pierre Fernandez · Alain Durmus · Matthijs Douze · Teddy Furon
East Exhibit Hall A-C #2411
[
Abstract
]
Fri 13 Dec 4:30 p.m. PST
— 7:30 p.m. PST
Abstract:
We investigate the *radioactivity* of text generated by large language models (LLM), i.e. whether it is possible to detect that such synthetic input was used as training data.Current methods like membership inference or active IP protection either work only in confined settings (e.g. where the suspected text is known) or do not provide reliable statistical guarantees.We discover that, on the contrary, LLM watermarking allows for reliable identification of whether the outputs of a watermarked LLM were used to fine-tune another language model.Our new methods, specialized for radioactivity, detects with confidence weak residuals of the watermark signal in the fine-tuned LLM.We link the radioactivity contamination level to the following properties: the watermark robustness, its proportion in the training set, and the fine-tuning process.We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence ($p$-value $< 10^{-5}$) even when as little as $5$% of training text is watermarked.
Live content is unavailable. Log in and register to view live content