We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!
๐งช Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.
๐ง Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.
๐ฅ Step 3: show we can go from base model -> SFT -> RL via multi-stage training.
๐ Digital Odyssey: AI Image & Video Generation Platform ๐จ Welcome to our all-in-one AI platform for image and video generation! ๐ โจ Key Features
๐จ High-quality image generation from text ๐ฅ Video creation from still images ๐ Multi-language support with automatic translation ๐ ๏ธ Advanced customization options
๐ซ Unique Advantages
โก Fast and accurate results using FLUX.1-dev and Hyper-SD models ๐ Robust content safety filtering system ๐ฏ Intuitive user interface ๐ ๏ธ Extended toolkit including image upscaling and logo generation
๐ฎ How to Use
Enter your image or video description Adjust settings as needed Click generate Save and share your results automatically
Introducing ๐๐ ๐ข๐ง๐๐๐๐ญ๐ก: the best public math pre-training dataset with 50B+ tokens! HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by: ๐ ๏ธ carefully extracting math data from Common Crawl; ๐ iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! ๐ Weโre also releasing all the ablation models as well as the evaluation code.
After some heated discussion ๐ฅ, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐ฅ
๐ผ๏ธ Multimodal > At Hugging Face we released SmolVLM, a performant and efficient smol vision language model ๐ > Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents ๐ค > Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length > ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM > Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model๐ > Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT ๐
๐ฌ LLMs > Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet ๐ฅ > AliBaba has released Marco-o1, a new open-source reasoning model ๐ฅ > NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
โฏ๏ธ Image/Video Generation > Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation > Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res โฏ๏ธ > Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla ๐ท๏ธ
Audio > OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
reacted to julien-c's
post with ๐๐ฅabout 2 months ago
INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.
๐ค HuggingFace Trending TOP 300 Board - Featuring AI Rating System ๐ Service Introduction A comprehensive dashboard that provides at-a-glance access to the real-time TOP 300 trending Spaces, Models, and Datasets on HuggingFace. Our specially developed AI rating system evaluates the practical value and growth potential of each item. โญ Key Features 1. AI Rising Rate
Growth potential evaluation based on creation date and ranking 5-tier star rating system (โ โ โ โ โ ) Evaluation Criteria:
Recency: Higher relative weights for recently created items Ranking Impact: Higher relative weights for top rankings Comprehensive assessment using statistical/analytical models applied to AI
2. AI Popularity Score
Comprehensive evaluation combining objective popularity and Rising Rate 18-tier grading system from AAA+ to B- Evaluation Elements:
Base Score: Benchmark based on likes, downloads, comments, etc. Additional Score: Rising Rate applied as a weighted factor Comprehensive assessment using statistical/analytical models applied to AI
AI/ML Project Trend Analysis Early Discovery of Promising Models/Datasets Community Activity Monitoring Research/Development Direction Reference
๐ก Key Advantages
Real-time TOP 300 ranking AI-based objective evaluation system Fast loading with caching system Intuitive and modern UI/UX Integrated dashboard for 3 categories
๐ Update Cycle
Real-time data reflection Manual refresh option Minimized server load through screenshot caching
- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
Which other tools should we add next?
reacted to prithivMLmods's
post with ๐ฅ2 months ago