NLP models are now being used almost in all industries due to their effectiveness. However, these require high computation to train and run. In the below-mentioned paper, the authors propose a method, called cramming, that examines training setups that maintain the total number of parameters within the model while reducing the cost of performing gradient updates by taking into account the scaling laws of large model transformers. As the authors outline a few interesting features of the transformer training design space, they demonstrate that cramming can provide interesting and sometimes comparable results with larger models requiring more computation in certain settings and with particular datasets by enumerating a small number of interesting features.

Abstract:

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day?

We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

Paper Link:

Cramming: Training a Language Model on a Single GPU in One Day
Recent trends in language modeling have focused on increasing performancethrough scaling, and have resulted in an environment where training languagemodels is out of reach for most researchers and practitioners. While most inthe community are asking how to push the limits of extreme computation,…

Github:

GitHub - JonasGeiping/cramming: Cramming the training of a (BERT-type) language model into limited compute.
Cramming the training of a (BERT-type) language model into limited compute. - GitHub - JonasGeiping/cramming: Cramming the training of a (BERT-type) language model into limited compute.

Review:

Cramming: Training a language model on a single GPU in one day
Cramming transformer-based language model pretraining into less compute, what happens?

Do you like our work?
Consider becoming a paying subscriber to support us!