Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Recent Activity

updated a dataset 7 minutes ago
librarian-bots/model_cards_with_metadata
updated a dataset 9 minutes ago
davanstrien/grpo-completions
published a dataset 14 minutes ago
davanstrien/grpo-completions
View all activity

Organizations

Hugging Face's profile picture Notebooks-explorers's profile picture Living with Machines's profile picture BigScience Workshop's profile picture Spaces-explorers's profile picture BigScience Catalogue Data's profile picture Hacks/Hackers's profile picture flyswot's profile picture BigScience: LMs for Historical Texts's profile picture Cohere For AI's profile picture Webhooks Explorers (BETA)'s profile picture HuggingFaceM4's profile picture Open Access AI Collective's profile picture HF Canonical Model Maintainers's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Hugging Face OSS Metrics's profile picture ImageIN's profile picture Stable Diffusion Bias Eval's profile picture Librarian Bots's profile picture Blog-explorers's profile picture Hacktoberfest 2023's profile picture Hugging Face TB Research's profile picture geospatial's profile picture HPLT's profile picture HF-IA-archiving's profile picture 2A2I Legacy Models & Datasets's profile picture testy's profile picture DIBT-for-Klingon's profile picture Wikimedia Movement's profile picture DIBT-for-Esperanto's profile picture Journalists on Hugging Face's profile picture PleIAs's profile picture Argilla Explorers's profile picture Persian AI Community's profile picture HuggingFaceFW's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture OMOTO AI's profile picture academic-datasets's profile picture HuggingFaceFW-Dev's profile picture Hugging Face Discord Community's profile picture UCSF-JHU Opioid Industry Documents Archive's profile picture Dataset Tools's profile picture PDFPages's profile picture dibt-private's profile picture Data Is Better Together Contributor's profile picture Bluesky Community's profile picture Open R1's profile picture

davanstrien's activity

posted an update about 1 hour ago
view post
Post
20
πŸ“Š Introducing "Hugging Face Dataset Spotlight" πŸ“Š

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4
reacted to stefan-it's post with πŸ”₯ about 2 hours ago
view post
Post
708
After running some 3DMark and FurMark benchmarks on Windows to make sure that my new 5090 is not causing melting cables [1] and some nice shots with a thermal camera (I don't think that's too much), running some fine-tuning experiments with my favorite Flair & Transformers libraries are very easy to perform.

Important steps:

Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:

sudo apt -y install cuda-toolkit-12-8 nvidia-open

I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.

If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:

NVIDIA Graphics Device with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.

But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

After that the latest Flair version can be installed and fine-tuning will work!

References:

[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
posted an update about 21 hours ago
view post
Post
1461
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though πŸ˜…

Here is an example for eth-nlped/stepverify
  • 1 reply
Β·
posted an update 8 days ago
view post
Post
2518
Hacked together a way to log trl GRPO training completions to a πŸ€— dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

posted an update 12 days ago
posted an update 14 days ago
view post
Post
1875
How do you make 1M+ Hugging Face models & datasets more discoverable?

davanstrien/Smol-Hub-tldr!

I fine-tuned HuggingFaceTB/SmolLM2-360M to generate one-line summaries from a model or dataset README.

Its own self-description?
"A model for generating concise summaries of model & dataset cards from the Hugging Face Hub"

The goal? Make it easier to find the right models and datasets for your specific needs. It's already powering a semantic search for datasets Space.

It's still a WIP but thanks to @loubnabnl , @anton-l , @eliebak et al, for cooking such a nice base model for fine-tuning small, efficient models for specific domains and tasks. πŸ™
posted an update 15 days ago
reacted to Ihor's post with πŸš€ 28 days ago
view post
Post
1459
πŸš€ Reproducing DeepSeek R1 for Text-to-Graph Extraction

I’ve been working on replicating DeepSeek R1, focusing on zero-shot text-to-graph extractionβ€”a challenging task where LMs extract entities and relations from text based on predefined types.

🧠 Key Insight:
Language models struggle when constrained by entity/relation types. Supervised training alone isn’t enough, but reinforcement learning (RL), specifically Guided Reward Policy Optimization (GRPO), shows promise.

πŸ’‘ Why GRPO?
It trains the model to generate structured graphs, optimizing multiple reward functions (format, JSON validity, and extraction accuracy).
It allows the model to learn from both positive and hard negative examples dynamically.
RL can be fine-tuned to emphasize relation extraction improvements.

πŸ“Š Early Results:
Even with limited training, F1 scores consistently improved, and we saw clear benefits from RL-based optimization. More training = better performance!

πŸ”¬ Next Steps:
We’re scaling up experiments with larger models and high-quality data. Stay tuned for updates! Meanwhile, check out one of our experimental models here:
Ihor/Text2Graph-R1-Qwen2.5-0.5b

πŸ“” Learn more details from the blog post: https://medium.com/p/d8b648d9f419

Feel free to share your thoughts and ask questions!
  • 2 replies
Β·
posted an update about 1 month ago
posted an update about 1 month ago
reacted to fdaudens's post with ❀️ about 1 month ago
view post
Post
8641
Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after:

- Original release: 8 models, 540K downloads. Just the beginning...

- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5Mβ€”nearly 5X the originals.

The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.

When you empower builders, innovation explodes. For everyone. πŸš€

The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version β€” 1M downloads alone.
Β·
posted an update about 1 month ago
view post
Post
2032
🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
β€’ Japanese
β€’ Italian
β€’ Old High German

Learn more and contribute: https://huggingface.co./blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.
  • 1 reply
Β·
reacted to tomaarsen's post with πŸ”₯❀️ about 1 month ago
view post
Post
4630
🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
πŸ“œ my training scripts, using the Sentence Transformers library
πŸ“Š my Weights & Biases reports with losses & metrics
πŸ“• my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
πŸ“ No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
πŸ“ Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
πŸͺ† Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co./blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1
  • 1 reply
Β·
reacted to AdinaY's post with πŸ”₯ about 1 month ago
view post
Post
3121
MiniMax, the company behind Hailuo_AI, has joined the open source community by releasing both models and demos of MiniMax-Text-01 & MiniMax-VL-01πŸ”₯
- Model
MiniMaxAI/MiniMax-VL-01
MiniMaxAI/MiniMax-Text-01
- Demo
MiniMaxAI/MiniMax-VL-01
MiniMaxAI/MiniMax-Text-01

✨ MiniMax-text-01:
- 456B with 45.9B activated per token
- Combines Lightning Attention, Softmax Attention, and MoE for optimal performance
- Training context up to 1M tokens, inference handles 4M tokens

✨ MiniMax-VL-01:
- ViT-MLP-LLM framework ( non-transformerπŸ‘€)
- Handles image inputs from 336Γ—336 to 2016Γ—2016
- 694M image-caption pairs + 512B tokens processed across 4 stages
  • 1 reply
Β·
reacted to AdinaY's post with πŸ”₯ about 1 month ago
view post
Post
3190
MiniCPM-o2.6 πŸ”₯ an end-side multimodal LLMs released by OpenBMB from the Chinese community
Model: openbmb/MiniCPM-o-2_6
✨ Real-time English/Chinese conversation, emotion control and ASR/STT
✨ Real-time video/audio understanding
✨ Processes up to 1.8M pixels, leads OCRBench & supports 30+ languages
reacted to their post with πŸ€— about 2 months ago
view post
Post
3070
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

πŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Β·
posted an update about 2 months ago
view post
Post
3070
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

πŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Β·
posted an update about 2 months ago
view post
Post
2260
The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co./blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c
  • 1 reply
Β·