Fine-Tuning Large Language Models Without Backpropagation- Using a Memory-Efficient Approach

Large language models with billions of parameters require huge amounts of memory and computing power when fine-tuning through backpropagation. This makes training larger and larger models challenging.

The researchers propose a new approach called MeZO (memory-efficient zeroth-order optimizer) that can fine-tune large language models using the same memory footprint as inference while achieving comparable performance to backpropagation fine-tuning.

MeZO is a zeroth-order optimizer that optimizes the model parameters directly in place without needing to store gradients. This allows MeZO to fine-tune extremely large models that cannot fit in memory during backpropagation fine-tuning.

With a single NVIDIA A100 80GB GPU, MeZO can fine-tune a 30-billion parameter model, while backpropagation fine-tuning can only handle a 2.7 billion parameter model.

Despite using 12x less memory, MeZO achieves results that are within 1% of backpropagation fine-tuning on 7 out of 11 tasks for a 13 billion parameter language model.

MeZO is also compatible with parameter-efficient fine-tuning techniques like LoRA and prefix tuning and can optimize non-differentiable objectives.

MeZO significantly outperforms conventional fine-tuning alternatives like in-context and linear probing techniques. These alternatives are limited in their ability to make full use of the model's parameters during fine-tuning. MeZO provides a full-parameter fine-tuning approach within the same memory constraints.

MeZO is compatible with parameter-efficient fine-tuning techniques like LoRA and prefix tuning. This allows for reducing the number of parameters updated during fine-tuning, which can further improve memory efficiency without sacrificing performance. The researchers found that combining MeZO with prefix tuning achieved the best results on some downstream tasks.

MeZO can optimize non-differentiable objectives directly, without needing to approximate gradients. This enables tasks like maximizing accuracy or F1 score during fine-tuning, which are non-differentiable metrics commonly used for classification and other tasks.

The researchers provide theoretical insight into why MeZO is able to effectively fine-tune huge models, despite conventional wisdom suggesting that zeroth-order methods would be too slow. They argue that the large capacities and pre-trained nature of these language models provide MeZO with enough information to successfully navigate the optimization landscape.

MeZO provides a memory-efficient approach for fine-tuning extremely large language models that expands what's possible with current hardware. The researchers hope this approach enables training ever larger and more powerful models in the future.

Read More:

Fine-Tuning Language Models with Just Forward Passes
Fine-tuning language models (LMs) has yielded success on diverse downstreamtasks, but as LMs grow in size, backpropagation requires a prohibitively largeamount of memory. Zeroth-order (ZO) methods can in principle estimate gradientsusing only two forward passes but are theorized to be catastrophi…

We research, curate and publish daily updates from the field of AI. Paid subscription gives you access to paid articles, a platform to build your own generative AI tools, invitations to closed events and open-source tools.
Consider becoming a paying subscriber to get the latest!