LLaMA

Meta, the research arm of Facebook, recently released a series of language models called LLaMA, which ranges in size from 7 billion to 65 billion parameters. These models were trained on publicly available datasets, without the use of proprietary or inaccessible data. The largest model, LLaMA-65B, is competitive with other top-performing models like Chinchilla and PaLM-540B.

The objective of training language models is to determine the best way to scale the dataset and model size for a particular training compute budget. However, this objective doesn't take into account the inference budget, which becomes critical when serving a language model at scale. When looking at performance levels, the preferred model is not necessarily the fastest to train, but the fastest at inference.

Language models are used to perform tasks like language translation, text summarization, and sentiment analysis. They require large amounts of data to be trained on, and the size of the data and the model determines their performance. The goal is to determine the best way to scale the dataset and model size for a particular training compute budget.

Meta's LLaMA models were designed to achieve the best possible performance at various inference budgets. Inference is the process of applying a trained model to new data and obtaining a prediction or output. In other words, inference is what happens when someone types a sentence into a language model and gets a response. For a language model to be useful at scale, it needs to be fast at inference.

To achieve better performance at inference, Meta trained the LLaMA models on more tokens than is typically used. Tokens are individual words or sub-words in a sentence, and more tokens mean more data to train on. By training on more tokens, Meta was able to create smaller models that could still perform well at inference. The largest LLaMA model is only 65 billion parameters, which is much smaller than other top-performing models like Chinchilla and PaLM-540B.

In this context, Meta trained the LLaMA models on more tokens than is typically used to achieve the best possible performance at various inference budgets. The resulting models are smaller and can run on a single GPU, making them accessible to a wider audience. Despite their smaller size, LLaMA-13B outperforms GPT-3 on most benchmarks.

Unlike other existing models, such as Chinchilla and PaLM, which rely on proprietary or undocumented data, the LLaMA models only use publicly available data, making them compatible with open-sourcing. In this blog, we will explore the modifications Meta made to the transformer architecture, their training method, and the performance of the models compared to other language models. Finally, we will also discuss the biases and toxicity encoded in the models, as uncovered by recent responsible AI benchmarks.