Fine-Tuning Large Language Models Without Backpropagation- Using a Memory-Efficient Approach
Large language models with billions of parameters require huge amounts of memory and computing power when fine-tuning through backpropagation. This makes training larger and larger models challenging.
The researchers propose a new approach called MeZO (memory-efficient zeroth-order optimizer) that can fine-tune large language models using the same memory footprint as inference while achieving comparable performance to backpropagation fine-tuning.
MeZO is a zeroth-order optimizer that optimizes the model parameters directly in place without needing to store gradients. This allows MeZO to fine-tune extremely large models that cannot fit in memory during backpropagation fine-tuning.
With a single NVIDIA A100 80GB GPU, MeZO can fine-tune a 30-billion parameter model, while backpropagation fine-tuning can only handle a 2.7 billion parameter model.
Despite using 12x less memory, MeZO achieves results that are within 1% of backpropagation fine-tuning on 7 out of 11 tasks for a 13 billion parameter language model.
MeZO is also compatible with parameter-efficient fine-tuning techniques like LoRA and prefix tuning and can optimize non-differentiable objectives.
MeZO significantly outperforms conventional fine-tuning alternatives like in-context and linear probing techniques. These alternatives are limited in their ability to make full use of the model's parameters during fine-tuning. MeZO provides a full-parameter fine-tuning approach within the same memory constraints.
MeZO is compatible with parameter-efficient fine-tuning techniques like LoRA and prefix tuning. This allows for reducing the number of parameters updated during fine-tuning, which can further improve memory efficiency without sacrificing performance. The researchers found that combining MeZO with prefix tuning achieved the best results on some downstream tasks.
MeZO can optimize non-differentiable objectives directly, without needing to approximate gradients. This enables tasks like maximizing accuracy or F1 score during fine-tuning, which are non-differentiable metrics commonly used for classification and other tasks.
The researchers provide theoretical insight into why MeZO is able to effectively fine-tune huge models, despite conventional wisdom suggesting that zeroth-order methods would be too slow. They argue that the large capacities and pre-trained nature of these language models provide MeZO with enough information to successfully navigate the optimization landscape.
MeZO provides a memory-efficient approach for fine-tuning extremely large language models that expands what's possible with current hardware. The researchers hope this approach enables training ever larger and more powerful models in the future.
Read More:
We research, curate and publish daily updates from the field of AI. Paid subscription gives you access to paid articles, a platform to build your own generative AI tools, invitations to closed events and open-source tools.
Consider becoming a paying subscriber to get the latest!