The AI community is buzzing with excitement as they unveil the latest iteration of the RedPajama dataset. This colossal dataset, now in its second version, boasts 30 trillion filtered and deduplicated tokens from a staggering 84 CommonCrawl dumps, spanning five languages.

It's not just the size that's impressive; it's the addition of over 40 pre-computed data quality annotations that sets this release apart, offering unprecedented resources for language model (LLM) training.

Why RedPajama Data V2 Matters

The success of state-of-the-art open LLMs hinges on the quality and quantity of training data. RedPajama Data V2 is a testament to this, providing a comprehensive pool of web data that can be refined and utilized for crafting high-quality datasets. This release is a significant step up from its predecessor, RedPajama-1T, which saw over 190,000 downloads, demonstrating the community's eagerness for robust training data.

Simplifying LLM Development

One of the most daunting tasks for LLM developers is processing and filtering raw data from sources like CommonCrawl. This process is not only laborious but also requires significant computational resources. RedPajama Data V2 aims to alleviate this burden by offering a base dataset that's already been processed with quality annotations, allowing developers to focus on creating and refining their models.

A Closer Look at the Dataset

RedPajama Data V2 encompasses:

  • Over 100 billion text documents, with more than 100 trillion raw tokens.
  • Quality annotations for a deduplicated subset of 30 trillion tokens.
  • Coverage of five languages: English, French, Spanish, German, and Italian.
  • Open-source data processing scripts available on GitHub and data accessible on HuggingFace.

How to Utilize RedPajama Data V2

The dataset is designed to be flexible, allowing developers to apply various filtering rules easily. For instance, implementing Gopher rules or the filtering rules used in RedPajama-v1 or C4 is straightforward, thanks to the pre-computed quality annotations.

A Living Project

RedPajama Data V2 is not static; it's a living project that will evolve with the community's input. The team behind it envisions continuous growth and enrichment of the dataset, incorporating more domains, snapshots, and quality signals over time.

Data Processing and Structure

The dataset is built from CommonCrawl data, processed through the CCNet pipeline to maintain as much raw information as possible. It includes exact deduplication with a Bloom filter and a detailed structure that aligns with CCNet, ensuring a comprehensive and usable dataset.

Empowering the Community

With RedPajama Data V2, Together is not only advancing its own model-building capabilities but also empowering the broader AI community. They encourage feedback and collaboration to further enrich the dataset and support the development of open LLMs.

Read More:

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI
Releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

RedPajama Data V2 is more than just a dataset; it's a catalyst for innovation in the field of AI language models. By providing a rich, pre-processed, and easily accessible data resource, it promises to accelerate the development of more sophisticated and nuanced LLMs, paving the way for advancements that we can only begin to imagine.

We research, curate, and publish daily updates from the field of AI. A paid subscription gives you access to paid articles, a platform to build your own generative AI tools, invitations to closed events, and open-source tools.
Consider becoming a paying subscriber to get the latest!