Inside the family of Smol models

Community Article Published February 27, 2025

🔳 We explore the power of datasets and their integration in Hugging Face's small language models family, particularly in SmolLM2

Small language models make AI accessible to a wide range of users because they can run on edge devices like smartphones. However, there is one important issue – small models have less capacity and are highly sensitive to training data. That’s why the quality of datasets is even more crucial for them than for large language models (LLMs).

Today, we are going to explore the key role of datasets in training small models, using the SmolLM family from Hugging Face as an example, with a special focus on their SmolLM2 models. While Hugging Face researchers have already published extensive materials on all SmolLM and SmolVLM models, we believe it’s important to compile everything in one place to highlight Hugging Face’s contributions to high-quality datasets and their training strategies for small AI models.

So, let’s dive into Hugging Face’s approach to achieving state-of-the-art performance in small LMs through mixing of the best datasets and step by step training.

📨 Click follow! If you want to receive our articles straight to your inbox, please subscribe here

In today’s episode, we will cover:

The secret behind small models training
How it all began: From SmolLM to SmolVLM2
SmolLM 2: The power of training data
New datasets and their optimal combination
Multi-stage training strategy
Making SmolLM2 handle more text
Instruction tuning and preference learning
- Instruction tuning with SmolTalk dataset
- Preference learning
Performance of SmolLM2
- Advantages
- Not without limitations
Conclusion
Bonus: Resources to dive deeper

The secret behind small models training

So, what’s the trick to making small models punch above their weight? There are a few key strategies that make all the difference:

Distillation – Training a smaller model to mimic a larger one while retaining its knowledge and skills.
Quantization – Reducing the precision of numerical data (e.g., using fewer bits per weight) to make the model smaller and more efficient.
Training from scratch – Some small models are designed from the ground up, leveraging carefully curated datasets and optimized architectures to maximize performance with fewer parameters.

Today we are going to discuss the third option - how to improve models reasoning through high-quality training data.

As smaller models have less capacity, the data used to train them is even more important than for large models. Instead of memorizing random facts, small models must be carefully optimized to learn essential knowledge and reasoning skills. Small models are also more sensitive to noise in the data, so choosing, filtering, and balancing different sources carefully is crucial. In particular, it is important to find the right mix of specialized datasets (like math and code) so that the model can learn effectively.

The details of how small models are trained are usually not shared by the developers. Here is where Hugging Face researchers, with their strong commitment to true open source, have contributed to the AI and ML community by creating their family of small models – including SmolLM, SmolLM2, SmolVLM, and SmolVLM2 – and providing their training guides.

How it all began: From SmolLM to SmolVLM2

Are all SmolLM models unified by a single common concept?

From the very beginning of SmolLM release on July 16, 2024, Hugging Face has chosen a strategy with the following concepts:

Building as much powerful small models as they can, which can be used on local devices from smartphones to laptops.
Training them on high-quality datasets.
Developing fully open models and datasets.

SmolLM, their first set of small AI models, is available in three sizes:

135M parameters (very lightweight)
360M parameters (balanced)
1.7B parameters (still small but more powerful)

It marked the first step toward high‐performance, on‑device small LMs. The path to achieving state-of-the-art performance for SmolLM lies in training them on a high-quality dataset called SmolLM-Corpus, which includes:

Cosmopedia v2 – 28 billion tokens of AI-generated textbooks, stories and code with over 34,000 topics. It mixed 40% middle school content (for better common sense and science knowledge), 30% college-level content and 30% other sources.
FineWeb-Edu – 220 billion words from collection of educational web pages selected from the broader FineWeb dataset.
Python-Edu – 4 billion words of high-quality Python code examples.

For training 135M and 360M versions Hugging Face used 600 billion tokens from the SmolLM-Corpus, while 1.7B version was trained on 1 trillion tokens. Their architectures are different, as well.

135M and 360M models use Grouped-Query Attention (GQA) to speed up inference. It means that multiple query heads share the same key-value pairs instead of having separate ones to reduce the computational cost. They also prioritizes depth over width (more layers, fewer neurons per layer) to improve performance while staying small. On the contrary, 1.7B model’s architecture is similar to traditional GPT-style transformers.

Here’s what results in common sense reasoning, world knowledge, and coding skills SmolLM models achieved:

Image Credit: SmolLM Hugging Face blog post

Back on November 2, 2024 Hugging Face unveiled SmolLM2 in three size: 135M, 360M, and 1.7B parameters.

This updated version improved upon its predecessor through enhanced data curation, like additional custom math and code datasets, and refined training techniques. To make it strong, researchers trained it on 11 trillion words of data in several steps.

SmolLM2 gained more popularity after Hugging Face published the SmolLM2 paper on February 4, 2025. We’ll discuss SmolLM2 further in our article.

Later on November 26, 2024 Hugging Face expanded their Smol models into multimodal territory. Small but powerful multimodal SmolVLM can answer questions about images, describe pictures, create stories based on multiple images, work as a regular text AI without images, but can not generate images.

Image Credit: SmolVLM blog post

SmolVLM is based on Idefics3 and leverages a smaller language model backbone, SmolLM2 1.7B, with context window, that was expended to 16,000 tokens by increasing the RoPE (rotary positional embedding) base value.

SmolVLM has the following features:

Improved image compression: It uses a “pixel shuffle” strategy that compresses image data more aggressively (9x reduction compared to 4x in previous designs). Pixel shuffle rearranges data from the depth (channel) dimension into the spatial dimensions (height and width). This speeds up processing and reduces memory use.
Visual token encoding: SmolVLM converts images into 81 visual tokens per 384×384 patch for efficient processing of large images.
Customizable resolution: Users can adjust image quality to balance performance and memory use.

In SmolVLM developers also paid special attention to the training data, so it was trained on The Cauldron and Docmatix datasets, with a focus on document understanding (25%) and image captioning (18%), plus skills like visual reasoning and chart analysis.

Image Credit: SmolVLM model card

Firstly, SmolVLM was released only in one size - 2B parameters, and later (on January 23, 2025) two more sizes, 256M and 500M, were released.

Image credit: Merve twitter

Although designed for images, SmolVLM’s long-context capability allows it to handle video analysis as well. By sampling multiple frames from a video (up to 50 frames in a simple pipeline), SmolVLM also shows competitive performance on video benchmarks.

However, true video reasoning only recently emerged on February 20, 2025, with SmolVLM2, which can “watch” videos and understand them – just like how you watch TV on your phone or laptop. It can analyze what’s happening in a video clip.

Hugging face released three versions of SmolVLM2:

256M model: An experimental, even tinier model that pushes the limits of how small a video model can be.
500M model: A smaller version that almost matches the 2.2B’s video skills.
2.2B model: The main model for complex video tasks – It is great at reading text in images, solving math problems shown in pictures, and answering science questions.

SmolVLM2 outperforms previous SmolVLM family in image math problems and scientific visual questions, understanding complex diagrams and text in photos.

Image Credit: SmolVLM2 blog

It also beats other video models in memory consumption.

Image Credit: SmolVLM2 blog

SmolVLM2 models were trained on data mixture that was used in training Meta GenAI and Stanford’s Apollo video LLMs, including “a moderate amount of text data and maintaining a slight video-heavy mix.”

Among all these small language models, SmolLM2 offers a practical guide and strategies for building competitive models of smaller size. So, it’s time to dive in!

SmolLM 2: The power of training data

Most language models are trained on text from the web, but high-quality AI models also include specialized data focused on math, programming, and following instructions to improve their ability to reason and solve complex problems.

With this in mind, the researchers behind SmolLM2:

Evaluated different types of training data (web text, code, math, and instructions) to balance the mix for the best performance.
Created new high-quality datasets (FineMath for math, Stack-Edu for coding, and SmolTalk for instructions) because existing ones were too small or had low quality.
Used a careful multi-stage training process, adjusting the data mix step by step to make sure SmolLM2 learned effectively.

Let’s explore everything in order.

Image Credit: Loubna Ben Allal's twitter

New datasets and their optimal combination

1. English data

Researchers first trained SmolLM2 on general web data and then fine-tuned it using math and coding datasets. They focused on two popular high-quality web datasets:

FineWeb-Edu: A dataset focused on educational content. It performed better on education-based tests, such as MMLU, ARC, OpenBookQA.
DCLM: A dataset with more diverse, conversational text, including insights from online discussions. Results showed that it was better on reasoning and common sense tests (HellaSwag, CommonSenseQA) and gave conversational diversity.

As both datasets have different strengths, this gives a hint that mixing them in different proportions is a good strategy. Researchers found that a 60% FineWeb-Edu + 40% DCLM mix gave the best overall results. In total, this approach resulted in 5.1 trillion tokens of high-quality English text for training SmolLM2.

2. Math data: Creating a new FineMath dataset

Since neither math dataset, such as OpenWebMath (OWM) and InfiMM-WebMath, was perfect, researchers built a new and improved dataset called FineMath with 54 billion tokens, focusing on:

Step-by-step math problem-solving, not just formulas.
Middle school to high school-level math, making it more balanced.
Filtering for quality, using AI to rate and select the best math content.

FineMath performed much better than the older datasets, doubling accuracy on basic math problems and improving advanced math performance by 6 times. This shows that better, more structured math data leads to stronger AI reasoning skills.

3. Code data: Stack-Edu

Training models on programming code also improves its overall reasoning skills. To upgrade SmolLM2’s coding skills, Hugging Face created Stack-Edu, a filtered version of StarCoder2Data. Its features include:

Educational code samples
Well-documented programs
15 major programming languages

Here, researchers also used AI to evaluate the educational quality of code and kept only the best parts, resulting in a highly refined dataset of 125 billion tokens. While Stack-Edu is a small dataset, it remains efficient, improving AI’s coding abilities across most programming languages. Java performed best with a slightly lower filter, demonstrating that different languages may need different training strategies.

Image Credit: The original SmolLM 2 paper

How to combine all these datasets to make the most of SmolLM2 training? Earlier we have mentioned that smart mixing of the data is the key, and another important thing is that longer training helps small models perform better while keeping costs low. That’s why Hugging Face came up with multi-stage training strategy.

Multi-stage training strategy

SmolLM2-1.7B was trained on 11 trillion tokens using a multi-stage approach, where the data mix was adjusted as training progressed to get the best results. Researchers followed four key principles:

Adapting data based on performance: They monitored the model’s progress and adjusted the dataset mix to fix weak spots.
Saving the best math (FineMath) and code (Stack-Edu) data for the final stages to maximize impact.
Gradual introduction of medium-sized datasets: Careful blending in math and coding data instead of adding everything at once.
Avoiding too much repetition: 4-5 rounds of training is the best limit.

Here is the four-stage training process of SmolLM2:

Image Credit: The original SmolLM 2 paper

Stage 1: General knowledge (0 to 6T tokens)

This step is focused on English web data (90% of training data), including 60% educational content from FineWeb-Edu and 40% general Q&A and reasoning data from DCLM.
Coding data was only 10% of training at this stage. Math data was not included yet.

The model performed as expected in general knowledge, but math and coding skills were weak.

Stage 2: Adding more coding & math (6 to 8T tokens)

Introduced math data (5%) using OpenWebMath (OWM). Increased coding data to 20%.
Kept English web data at 75% with the same FineWeb-Edu & DCLM mix.

At this stage coding skills of SmolLM2 improved but math performance was still weak.

Stage 3: More math & coding refinement (8 to 10T tokens)

Increased math data to 10% by adding InfiMM-WebMath.
Replaced StarCoderData with Stack-Edu for better code learning.
Added Jupyter Notebooks to help the model understand code explanations better.
Adjusted English web mix to 40% FineWeb-Edu & 60% DCLM.

Performance improved, but there was a temporary drop in accuracy, a "loss spike”, which meant a sudden and temporary increase in the loss value during training. By the end, the model recovered and continued improving.

Stage 4: Final fine-tuning & wrapping up (10 to 11T tokens)

Researchers gradually lowered the learning rate (this process is called "decay").
They added the highest-quality math data: FineMath4+ & InfiWebMath-3+. Math data now made up 14% of training.
Expanded Stack-Edu to cover more programming languages.
Kept 58% English web data with a higher DCLM ratio.

This stage resulted in the biggest improvement in math and coding performance, along with an upgrade in general knowledge.

Image Credit: The original SmolLM 2 paper

This multi-stage training is used for the “largest” SmolLM2 version with 1.7B parameters. As SmolLM2 360M and 135M versions require less computing power, have less capacity, the training process was simplified for them. Instead of multi-stage training, they used a single-stage approach with consistently high-quality data. The training data mixture was refined by:

Filtering DCLM with FineWeb-Edu’s classifier to remove low-quality samples.
Including Stack-Edu for coding, InfiMM-WebMath and FineMath for math reasoning, and Cosmopedia for structured knowledge from the start.

In total, SmolLM2-360M was trained on 4 trillion tokens and SmolLM2-135M - on 2 trillion tokens.

Making SmolLM2 handle more text

The lengthy four-stage training is not enough for unlocking the full potential of SmolLM2. To deal with the full range of complex tasks it needs to process more context effectively. To improve this, researchers increased SmolLM2’s context length from 2K to 8K words. They used a special data mix with long documents, including: books subset of Dolma (20%), educational & general web text from DCLM (10%) and FineWeb-Edu (10%), and dataset mixture from Stage 4 (60%).

Combining all the training aspects resulted in SmolLM2-1.7B achieving state-of-the-art performance. But its not the last stage of training.

Image Credit: The original SmolLM 2 paper

Instruction tuning and preference learning

Post-training is another important step in improving models reasoning, accuracy of responses and instruction following. Here is how it was done for all SmolLM2 models.

Instruction tuning with SmolTalk dataset

For Supervised Fine-Tuning (SFT) of SmolLM2-1.7B, researchers created the SmolTalk dataset to improve SmolLM2’s ability to understand and follow instructions. It was built by filtering MagPie-Ultra dataset and included:

Conversational data (from MagPie-Ultra): Teaches SmolLM2 to handle natural conversations.
Task-specific data: It helps SmolLM2 improve in areas like following complex instructions (Smol-Constraint), summarization (Smol-Summarization), and rewriting text (Smol-Rewrite).
Math data: Improves SmolLM2’s math reasoning skills.
Code data & long-context data (from LongAlign): Helps with coding tasks and makes SmolLM2 better at handling long text inputs.
Persona-based data from datasets like PersonaHub, emails, tweets, etc., to teach SmolLM2 how to write in different tones and styles.

SmolTalk outperformed other instruction-tuning datasets in multiple benchmarks and helped SmolLM2 give clearer, more structured, and useful responses.

For SFT of SmolLM2-360 and SmolLM2-135M, researchers used a simplified version of SmolTalk, removing complex tasks like function calling and focusing on easier-to-handle instructions that match the smaller models' capabilities.

Preference learning

This final stage of training teaches SmolLM2-1.7B to prefer better-quality responses using Direct Preference Optimization (DPO). Researchers used UltraFeedback dataset, which worked best for improving reasoning and knowledge, and trained SmolLM2 over 2 rounds, adjusting learning rates. After all tuning and preference learning, we got the final "Instruct" version of SmolLM2-1.7B.

In 360M and 135M versions, preference learning also used UltraFeedback to improve instruction-following while keeping responses clear and helpful.

Thank to these strategies and carefully mixed datasets, SmolLM2 turned into an all-rounder AI model, that performs well across multiple domains. Let’s see what reasoning results achieved.

SmolLM2 performance

Compared to similar models like Qwen2.5-1.5B-Instruct and Llama3.2-1B-Instruct, SmolLM2-1.7B has the following results:

It’s better at instruction-following than Qwen2.5-1.5B.
Shows strong performance in conversation, reasoning, and rewriting tasks.
SmolLM remains competitive in math (GSM8K, MATH) and knowledge-based tests.

Image Credit: The original SmolLM 2 paper

Well, there’s already plenty of information here, so it’s time to summarize what makes SmolLM2 an outstanding small model.

Advantages of SmolLM2

Smaller size, but strong performance: As we have seen SmolLM2-1.7B outperforms other small models, such as Qwen2.5-1.5B and Llama3.2-1B, or remains competitive in math, coding, reasoning, and instruction-following.
Lower computational cost: SmolLM2 set of models is designed to run efficiently on devices with limited resources while maintaining high accuracy in reasoning, coding, and instruction-following tasks.
Carefully curated datasets: Unlike models that rely primarily on raw web data, SmolLM2 models were trained using a balanced mix of high-quality sources.
Efficient multi-stage training strategy without overfitting.
Extended context length from 2K to 8K tokens: This makes SmolLM2 better suited for tasks like summarization and deep reading. Additionally, there is no major performance loss after extending context length.
The model, datasets, and training code are open-source, allowing researchers and developers to build on top of SmolLM2.
SmolLM2-360M and SmolLM2-135M versions allow for even more lightweight AI applications, running on low-resource devices while still maintaining strong performance.

Not without limitations

Despite being strong performer for its size, SmolLM2-1.7B still has some key limitations:

SmolLM2 does not have the best performance results, as it still lags behind some other small models in math and coding tasks. In following instructions, it is also not the absolute best and can still improve in some areas.
It may sometimes struggle with complex reasoning just because it is a small model that cannot store as much knowledge as larger models.
It doesn’t perform as well as other models like Qwen2.5-1.5B in retrieving specific information from long inputs.
Expensive training: SmolLM2 training still requires significant computational resources, making it expensive to train from scratch ($250K in compute).
While SmolLM2 learned to prefer better answers, it relies heavily on synthetic feedback, so its understanding of nuanced human preferences is still evolving.

Conclusion

Small language models make AI accessible to everyone and have their own unique ways of becoming powerful and competitive. In this episode, we discussed the entire family of Hugging Face’s “Smol” models – including SmolLM, SmolLM2, SmolVLM, and SmolVLM2 – and explored SmolLM2’s journey to strong reasoning through a carefully curated mixture of high-quality datasets and a smart multi-step training strategy.

With this comprehensive training guide from Hugging Face, your path to building powerful small models is now open.

Author: Alyona Vert Editor: Ksenia Se

Bonus: Resources to dive deeper

Sources from Turing Post

Token 1.10: Large vs Small in AI: The Language Model Size Dilemma

That’s all for today. Thank you for reading!

Please share this article to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote