Introducing ππ π’π§πππππ‘: the best public math pre-training dataset with 50B+ tokens! HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by: π οΈ carefully extracting math data from Common Crawl; π iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! π Weβre also releasing all the ablation models as well as the evaluation code.
My latest project is the outcome of the last 2+ years working with TPUs from the amazing TPU Research Cloud (TRC) program and training Encoder-only LMs with the TensorFlow Model Garden library.
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden - Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models - Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
Open Preference Dataset for Text-to-Image Generation by the π€ Community
Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.
SmolVLM speeding locally on a laptop thanks to mlx-vlm and @Gradio ! Try it with two lines: pip install git+https://github.com/andimarafioti/mlx-vlm.git@stream-generate-fix python -m mlx_vlm.chat_ui --model mlx-community/SmolVLM-Instruct-8bit
Gotta love the MLX community! Big thanks to @pcuenq and @prince_canuma !
reacted to MohamedRashad's
post with π25 days ago
Zhipu AI, the Chinese generative AI startup behind CogVideo, just launched their first productized AI Agent - AutoGLM π₯ π https://agent.aminer.cn
With simple text or voice commands, it: β¨ Simulates phone operations effortlessly β¨ Autonomously handles 50+ step tasks β¨ Seamlessly operates across apps
Powered by Zhipu's "Decoupled Interface" and "Self-Evolving Learning Framework" to achieve major performance gains in Phone Use and Web Browser Use!
Meanwhile, GLM4-Edge is now on Hugging Face hubπ π THUDM/glm-edge-6743283c5809de4a7b9e0b8b Packed with advanced dialogue + multimodal models: π± 1.5B / 2B models: Built for mobile & in-car systems π» 4B / 5B models: Optimized for PCs
BlackForest Labs Flux Dev VS. Stability AI Stable Diffusion Large 3.5
Together with the β data-is-better-together community, we've worked on an Apache 2.0 licensed open image preference dataset based on the fal ai imgsys prompts dataset. Thanks to the awesome community, we have managed to get 5K preference pairs in less than 2 days. The annotation alignment among annotators is great too.
Aashish Kumar won a month of Hugging Face Pro by making the most contributions! Congrats from the entire team π₯
The best thing?! We are not done yet! Let's keep the annotations coming for 5K more in the second part of the sprint! (with more prices to go around).
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! π€― - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! π - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!
Would you like to get a high-quality dataset to pre-train LLMs in your language? π
At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.
Follow the link below, check if your language is listed and sign up to be a Language Lead!
π 1M public posts from Bluesky's firehose API π Includes text, metadata, and language predictions π¬ Perfect to experiment with using ML for Bluesky π€
Excited to see people build more open tools for a more open social media platform!
π 1M public posts from Bluesky's firehose API π Includes text, metadata, and language predictions π¬ Perfect to experiment with using ML for Bluesky π€
Excited to see people build more open tools for a more open social media platform!
reacted to KnutJaegersberg's
post with β€οΈπ₯28 days ago
Letβs make a generation of amazing image-generation models
The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Letβs change that!
The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.
The Bluesky AT Protocol unlocks exciting possibilities: - Building custom feeds using ML - Creating dashboards for data exploration - Developing custom models for Bluesky To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co./bluesky-community
My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API π°