Timezone: »

Fine-Tuning Language Models with Just Forward Passes
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Alex Damian · Jason Lee · Danqi Chen · Sanjeev Arora

Wed Dec 13 02:15 PM -- 02:30 PM (PST) @

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12× memory reduction and up to 2× GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

Author Information

Sadhika Malladi (Princeton University)
Tianyu Gao (Princeton University)
Tianyu Gao

Tianyu Gao is a fourth-year PhD student at Princeton University, advised by Prof. Danqi Chen.

Eshaan Nichani (Princeton University)
Alex Damian (Princeton University)
Jason Lee (Princeton University)
Danqi Chen (Princeton University)
Sanjeev Arora (Princeton University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors