SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)
π Model Summary
This is a SmolLM2-135M model fine-tuned using the Guided Reward Policy Optimization (GRPO) technique on a subset of the GSM8K dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a DPU P-100 accelerator with 21GB VRAM.
π Training Details
π Training Configuration
- Base Model:
HuggingFaceTB/SmolLM2-135M-Instruct
- Fine-Tuning Technique: GRPO (Guided Reward Policy Optimization)
- Dataset: GSM8K (first 1500 samples)
- GPU Used: NVIDIA Tesla P100 (21GB VRAM)
- Precision:
float16
- Optimizer:
adamw_torch_fused
- Batch Size:
8
- Gradient Accumulation Steps:
2
- Max Prompt Length:
128
- Max Completion Length:
100
- Epochs:
1
- Learning Rate:
5e-6
- LR Scheduler:
cosine
- Weight Decay:
0.2
- Logging Steps:
1
- FP16 Enabled: β
π Reward Functions Used
The model was optimized using the following reward functions:
xmlcount_reward_func
soft_format_reward_func
strict_format_reward_func
int_reward_func
correctness_reward_func
π Dataset Details
The model was trained on a subset of the GSM8K dataset. The dataset was processed as follows:
- The first 1500 samples were selected to reduce training time.
- Each training sample consisted of a question (prompt) and a ground truth answer extracted using:
def extract_hash_answer(text: str) -> str | None: if "####" not in text: return None return text.split("####")[1].strip()
- The dataset was loaded and formatted using:
def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset: data = load_dataset('openai/gsm8k', 'main')[split] data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples data = data.map(lambda x: { 'prompt': [ {'role': 'system', 'content': SYSTEM_PROMPT}, {'role': 'user', 'content': x['question']} ], 'answer': extract_hash_answer(x['answer']) }) return data
β‘ Performance & Limitations
- The model was fine-tuned on limited data (1500 samples instead of the full dataset).
- Due to hardware constraints (P100, 21GB VRAM), some training optimizations were made to improve efficiency.
- The model is expected to perform well on mathematical reasoning tasks but may have limited generalization due to the small training set.
π§ How to Use
You can use this model with Hugging Face Transformers as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your-username/SmolLM2-135M-GRPO"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate output
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
π Acknowledgements
- Hugging Face Team for SmolLM2-135M
- OpenAI GSM8K dataset
- GRPO fine-tuning technique for reward-based optimization
π Future Work
- Increase dataset size for better generalization.
- Optimize training on larger GPUs (e.g., A100, H100).
- Experiment with different reward functions to improve accuracy.
- Downloads last month
- 0
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for Macromrit/SmolLM2-135M-GRPO-Trained-For-Reasoning
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct