The field of artificial intelligence (AI) has witnessed remarkable progress in recent years, driven by the development of powerful foundation models such as GPT-4. However, these models are often closed behind commercial APIs, limiting their use with sensitive data and restricting research and customization.

The RedPajama project aims to address this limitation by creating fully open-source models that can rival the quality of closed models. RedPajama is a collaboration between Together,, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.

RedPajama has three key components: pre-training data, base models, and instruction tuning data and models. Today, the project has announced the completion of the first component, pre-training data. This data set, called LLaMA, consists of over 1.2 trillion tokens and is both high quality and has broad coverage.

The availability of this pre-training data is a significant milestone in the RedPajama project. It will enable researchers and developers to create their own open-source language models, which can be customized and trained for specific use cases. By making these models fully open-source, RedPajama can remove barriers to innovation and democratize access to AI.

The RedPajama project is part of a broader movement towards open-source AI, which has the potential to revolutionize the field. Open-source models can lead to incredible creativity and innovation from broad participation by communities around the world. In many ways, AI is having its Linux moment, and RedPajama is at the forefront of this movement.

We are excited to see what the future holds for the RedPajama project and the wider open-source AI community. It is an exciting time to be involved in AI, and we look forward to seeing the incredible advancements that will be made possible by open-source models like those developed by RedPajama.

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens — TOGETHER
RedPajama is a project to create a set of leading, fully open-source models. Today, we are excited to announce the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.

We research, curate and publish daily updates from the field of AI. Paid subscription gives you access to paid articles, a platform to build your own generative AI tools, invitations to closed events and open-source tools.
Consider becoming a paying subscriber to get the latest!