Dataset Tools

community

AI & ML interests

Tools for creating and exploring datasets

Recent Activity

Dataset-Tools's activity

fdaudensΒ 
posted an update about 3 hours ago
view post
Post
124
What if AI becomes as ubiquitous as the internet, but runs locally and transparently on our devices?

Fascinating TED talk by @thomwolf on open source AI and its future impact.

Imagine this for AI: instead of black box models running in distant data centers, we get transparent AI that runs locally on our phones and laptops, often without needing internet access. If the original team moves on? No problem - resilience is one of the beauties of open source. Anyone (companies, collectives, or individuals) can adapt and fix these models.

This is a compelling vision of AI's future that solves many of today's concerns around AI transparency and centralized control.

Watch the full talk here: https://www.ted.com/talks/thomas_wolf_what_if_ai_just_works
davanstrienΒ 
posted an update about 5 hours ago
view post
Post
132
πŸ“Š Introducing "Hugging Face Dataset Spotlight" πŸ“Š

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4
davanstrienΒ 
posted an update 1 day ago
view post
Post
1744
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though πŸ˜…

Here is an example for eth-nlped/stepverify
  • 1 reply
Β·
fdaudensΒ 
posted an update 1 day ago
view post
Post
2430
Is this the best tool to extract clean info from PDFs, handwriting and complex documents yet?

Open source olmOCR just dropped and the results are impressive.

Tested the free demo with various documents, including a handwritten Claes Oldenburg letter. The speed is impressive: 3000 tokens/second on your own GPU - that's 1/32 the cost of GPT-4o ($190/million pages). Game-changer for content extraction and digital archives.

To achieve this, Ai2 trained a 7B vision language model on 260K pages from 100K PDFs using "document anchoring" - combining PDF metadata with page images.

Best part: it actually understands document structure (columns, tables, equations) instead of just jumbling everything together like most OCR tools. Their human eval results back this up.

πŸ‘‰ Try the demo: https://olmocr.allenai.org

Going right into the AI toolkit: JournalistsonHF/ai-toolkit
  • 3 replies
Β·
prithivMLmodsΒ 
posted an update 3 days ago
view post
Post
3518
Dropping some of the custom fine-tunes based on SigLIP2,
with a single-label classification problem type! πŸŒ€πŸ§€

- AI vs Deepfake vs Real : prithivMLmods/AI-vs-Deepfake-vs-Real-Siglip2
- Deepfake Detect : prithivMLmods/Deepfake-Detect-Siglip2
- Fire Detection : prithivMLmods/Fire-Detection-Siglip2
- Deepfake Quality Assess : prithivMLmods/Deepfake-Quality-Assess-Siglip2
- Guard Against Unsafe Content : prithivMLmods/Guard-Against-Unsafe-Content-Siglip2

🌠Collection : prithivMLmods/siglip2-custom-67bcdb2de8fe96b99fb4e19e
fdaudensΒ 
posted an update 4 days ago
view post
Post
3162
πŸš€ Just launched: A toolkit of 20 powerful AI tools that journalists can use right now - transcribe, analyze, create. 100% free & open-source.

Been testing all these tools myself and created a searchable collection of the most practical ones - from audio transcription to image generation to document analysis. No coding needed, no expensive subscriptions.

Some highlights I've tested personally:
- Private, on-device transcription with speaker ID in 100+ languages using Whisper
- Website scraping that just works - paste a URL, get structured data
- Local image editing with tools like Finegrain (impressive results)
- Document chat using Qwen 2.5 72B (handles technical papers well)

Sharing this early because the best tools come from the community. Drop your favorite tools in the comments or join the discussion on what to add next!

πŸ‘‰ JournalistsonHF/ai-toolkit
prithivMLmodsΒ 
posted an update 6 days ago
view post
Post
5748
It's really interesting about the deployment of a new state of matter in Majorana 1: the world’s first quantum processor powered by topological qubits. If you missed this news this week, here are some links for you:

πŸ…±οΈTopological qubit arrays: https://arxiv.org/pdf/2502.12252

βš›οΈ Quantum Blog: https://azure.microsoft.com/en-us/blog/quantum/2025/02/19/microsoft-unveils-majorana-1-the-worlds-first-quantum-processor-powered-by-topological-qubits/

πŸ“– Read the story: https://news.microsoft.com/source/features/innovation/microsofts-majorana-1-chip-carves-new-path-for-quantum-computing/

πŸ“ Majorana 1 Intro: https://youtu.be/Q4xCR20Dh1E?si=Z51DbEYnZFp_88Xp

πŸŒ€The Path to a Million Qubits: https://youtu.be/wSHmygPQukQ?si=TS80EhI62oWiMSHK
Β·
fdaudensΒ 
posted an update 7 days ago
davanstrienΒ 
posted an update 8 days ago
view post
Post
2521
Hacked together a way to log trl GRPO training completions to a πŸ€— dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

alielfilali01Β 
posted an update 8 days ago
view post
Post
674
🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.
fdaudensΒ 
posted an update 10 days ago
prithivMLmodsΒ 
posted an update 10 days ago
view post
Post
3885
Dino: The Minimalist Multipurpose Chat System 🌠
Agent-Dino : prithivMLmods/Agent-Dino
Github: https://github.com/PRITHIVSAKTHIUR/Agent-Dino

By default, it performs the following tasks:
{Text-to-Text Generation}, {Image-Text-Text Generation}
@image: Generates an image using Stable Diffusion xL.
@3d: Generates a 3D mesh.
@web: Web search agents.
@rAgent: Initiates a reasoning chain using Llama mode for coding explanations.
@tts1-♀, @tts2-β™‚: Voice generation (Female and Male voices).
@yolo : Object Detection
fdaudensΒ 
posted an update 12 days ago
view post
Post
2268
Will we soon all have our own personalized AI news agents? And what does it mean for journalism?

Just built a simple prototype based on the Hugging Face course. It lets you get customized news updates on any topic.

Not perfect yet, but you can see where things could go: we'll all be able to build personalized AI agents that curate & analyze news for each of us. And users who could decide to build custom news products for their needs, such as truly personalized newsletters or podcasts.

The implications for both readers & news organizations are significant. To name a few:
- Will news articles remain the best format for informing people?
- What monetization model will work for news organizations?
- How do you create an effective conversion funnel?

πŸ‘‰ Try it here: fdaudens/my-news-agent (Code is open-source)
πŸ‘‰ Check out the course: https://huggingface.co./learn/agents-course/unit0/introduction
prithivMLmodsΒ 
posted an update 12 days ago
view post
Post
4471
The last week of Impression Craft Arts and sketches from strangerzonehfπŸŽ¨πŸ§‘πŸ»β€πŸŽ¨

- Collection : strangerzonehf/Flux-Ultimate-LoRA-Collection

Adapters:
+ Ld-Art : strangerzonehf/Ld-Art
+ Animeopix-Flux : strangerzonehf/Animeopix-Flux
+ Flux-Super-Paint-LoRA : strangerzonehf/Flux-Super-Paint-LoRA
+ CinematicShot-Pics-Flux : strangerzonehf/cinematicShot-Pics-Flux
+ Oil-Wall-Art-Flux : strangerzonehf/Oil-Wall-Art-Flux
+ Pixelo-Flux : strangerzonehf/Pixelo-Flux
+ Abstract-Shattered : strangerzonehf/Abstract-Shattered
+ Neon-Impressionism-Flux : strangerzonehf/Neon-Impressionism-Flux
+ NewG-Art : strangerzonehf/NewG-Art

πŸͺ§Demo : prithivMLmods/FLUX-LoRA-DLC
πŸ€—Page : https://huggingface.co./strangerzonehf
louisbrulenaudetΒ 
posted an update 12 days ago
view post
Post
3051
I am pleased to introduce my first project built upon Hugging Face’s smolagents framework, integrated with Alpaca for financial market analysis automation πŸ¦™πŸ€—

The project implements technical indicators such as the Relative Strength Index (RSI) and Bollinger Bands to provide momentum and volatility analysis. Market data is retrieved through the Alpaca API, enabling access to historical price information across various timeframes.

AI-powered insights are generated using Hugging Face’s inference API, facilitating the analysis of market trends through natural language processing with DuckDuckGo search integration for real-time sentiment analysis based on financial news πŸ¦†

Link to the GitHub project: https://github.com/louisbrulenaudet/agentic-market-tool

davanstrienΒ 
posted an update 12 days ago
fdaudensΒ 
posted an update 14 days ago
view post
Post
2117
πŸ”Š Meet Kokoro Web - Free, ML speech synthesis on your computer, that'll make you ditch paid services!

28 natural voices, unlimited generations, and WebGPU acceleration. Perfect for journalists and content creators.

Test it with full articlesβ€”sounds amazingly human! πŸŽ―πŸŽ™οΈ

Xenova/kokoro-web
davanstrienΒ 
posted an update 14 days ago
view post
Post
1876
How do you make 1M+ Hugging Face models & datasets more discoverable?

davanstrien/Smol-Hub-tldr!

I fine-tuned HuggingFaceTB/SmolLM2-360M to generate one-line summaries from a model or dataset README.

Its own self-description?
"A model for generating concise summaries of model & dataset cards from the Hugging Face Hub"

The goal? Make it easier to find the right models and datasets for your specific needs. It's already powering a semantic search for datasets Space.

It's still a WIP but thanks to @loubnabnl , @anton-l , @eliebak et al, for cooking such a nice base model for fine-tuning small, efficient models for specific domains and tasks. πŸ™
davanstrienΒ 
posted an update 15 days ago
fdaudensΒ 
posted an update 15 days ago
view post
Post
2681
⭐️ The AI Energy Score project just launched - this is a game-changer for making informed decisions about AI deployment.

You can now see exactly how much energy your chosen model will consume, with a simple 5-star rating system. Think appliance energy labels, but for AI.

Looking at transcription models on the leaderboard is fascinating: choosing between whisper-tiny or whisper-large-v3 can make a 7x difference. Real-time data on these tradeoffs changes everything.

166 models already evaluated across 10 different tasks, from text generation to image classification. The whole thing is public and you can submit your own models to test.

Why this matters:
- Teams can pick efficient models that still get the job done
- Developers can optimize for energy use from day one
- Organizations can finally predict their AI environmental impact

If you're building with AI at any scale, definitely worth checking out.

πŸ‘‰ leaderboard: https://lnkd.in/esrSxetj
πŸ‘‰ blog post: https://lnkd.in/eFJvzHi8

Huge work led by @sasha with @bgamazay @yjernite @sarahooker @regisss @meg
  • 1 reply
Β·