Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
6.8
TFLOPS
5
17
rasgaard
rasgaard
Follow
AndreasLH's profile picture
nataliaElv's profile picture
21world's profile picture
7 followers
Ā·
36 following
AI & ML interests
None yet
Recent Activity
upvoted
an
article
8 days ago
From Llasa to Llasagna š: Finetuning LLaSA to generates Italian speech and other languages
liked
a model
29 days ago
hexgrad/Kokoro-82M
reacted
to
davanstrien
's
post
with š¤
about 1 month ago
Introducing scandi-fine-web-cleaner https://huggingface.co./davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations! FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it? Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative. Today, I'm happy to share the first classifier trained on this data. š What we've built: - A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute š Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co./datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers. Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
View all activity
Organizations
Papers
1
arxiv:
2305.17154
models
12
Sort:Ā Recently updated
rasgaard/luke-base-newsgroups-finetuned
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
120
rasgaard/luke-base-newsgroups-probe
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
143
rasgaard/squeezebert-newsgroups-finetuned
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
200
rasgaard/squeezebert-newsgroups-probe
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
182
rasgaard/distilbert-newsgroups-finetuned
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
200
rasgaard/distilbert-newsgroups-probe
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
221
rasgaard/bert-newsgroups-finetuned
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
163
rasgaard/bert-newsgroups-probe
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
199
rasgaard/roberta-newsgroups-finetuned
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
244
rasgaard/roberta-newsgroups-probe
Text Classification
ā¢
Updated
Feb 28, 2024
ā¢
214
Expand 12 models
datasets
3
Sort:Ā Recently updated
rasgaard/mmi-bendr-preprocessed
Viewer
ā¢
Updated
Feb 19, 2024
ā¢
4.41k
ā¢
48
rasgaard/20_newsgroups
Viewer
ā¢
Updated
Sep 13, 2023
ā¢
18.8k
ā¢
286
rasgaard/FTRACE-Synth
Viewer
ā¢
Updated
Feb 20, 2023
ā¢
3.2M
ā¢
35