MarkTune: Advancing the Quality-Detectability Pareto Frontier of Open-Weight LM Watermarking
Abstract
Open-weight language models raise acute challenges for watermarking because inference-time interventions cannot be enforced once model weights are public. Existing methods, such as the recently proposed GaussMark, typically involve subtly modifying model weights. While such schemes demonstrate that imperceptible perturbations can yield detectable signals, they require computationally intensive parameter searches and achieve only limited progress along the quality-detectability frontier. We introduce MarkTune, a theoretically principled on-policy fine-tuning framework that treats watermark detectability as a reward signal while regularizing against degradation in text quality. We instantiate our approach with GaussMark as a base watermarking scheme and demonstrate that MarkTune consistently improves the quality-detectability trade-off over vanilla GaussMark by adapting non-watermarked weights to maintain generation quality. Empirically, we show that MarkTune consistently advances the quality-detectability Pareto frontier: it improves true positive rates under fixed false positive thresholds, restores perplexity and benchmark accuracy to near-unwatermarked levels, and remains robust under paraphrasing and translation attacks. Together, these results establish on-policy fine-tuning as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.