FineWeb2-C: Help Build Better Language Models in Your Language

Community Article Published December 23, 2024

Upvote

davanstrien Daniel van Strien

dvilasuero Daniel Vila

nataliaElv Natalia Elvira

guipenedo Guilherme Penedo

hynky Hynek Kydlicek

thomwolf Thomas Wolf

tl;dr We're developing educational quality classifiers to help create better open LLMs in more languages. Ready to contribute? Start annotating here. Want to learn more? Read on.

Why Dataset Quality Matters

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. A pretraining dataset consists of massive amounts of text that help the model develop its fundamental language capabilities – a vital component for training a strong LLM in any language.

Current Filtering Approaches

Recently, many projects have found that applying various quality filters to pre-training datasets can help improve the performance of downstream models trained on this text. These filters include:

Applying URL filtering using a blocklist to remove adult content and low-quality web pages
Rule-based filters which remove very repetitive or machine-generated text patterns
Language filters to ensure texts match the target language and remove mixed-language content

Refining by Educational Quality?

Recently, the authors of the FineWeb demonstrated that filtering a pretraining dataset to high educational quality could improve the resulting downstream models. This was done using a classifier trained on synthetically labelled data using Llama-3-70B-Instruct.

Why do we need help annotating?

This approach works well for English but may not work for other languages. This is where you can help build better datasets and models for your language. The FineWeb2-C initiative aims to create large, high-quality datasets for pretraining language models in many languages. We're doing this by building educational-quality classifiers through a community-driven effort to rate the quality of texts in many languages. Additionally, these datasets can be useful for other applications, such as a source of high-quality reference data in each language, benchmarking, and improving model (synthetic) annotation capabilities.

What has been done so far?

After around two weeks, the community has already greatly impacted this effort. We've already released the first version of the dataset, covering 12 languages reaching the 1,000 annotations threshold. We've already seen:

34,571 total annotations submitted:
95 Languages with annotations:
321 total contributors

You can find a full leaderboard for languages and contributions in this leaderboard Space.

We believe that open-source AI can be more inclusive and do amazing things when the community works together 🤗

How to Start Annotating

Create a Hugging Face Account (if you don't have one)
Visit our Argilla Space and login with your Hugging Face account
Select the language you'd like to annotate
Read the annotation guidelines carefully before starting
Start Annotating!

Spread the Word!

Beyond annotating, you can also help ensure we reach all language communities by spreading the word. Need help? Join our community discussion.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote