GRPO Training Overview

SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)

πŸ“Œ Model Summary

This is a SmolLM2-135M model fine-tuned using the Guided Reward Policy Optimization (GRPO) technique on a subset of the GSM8K dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a DPU P-100 accelerator with 21GB VRAM.

πŸ“Š Training Details

πŸ›  Training Configuration

  • Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
  • Fine-Tuning Technique: GRPO (Guided Reward Policy Optimization)
  • Dataset: GSM8K (first 1500 samples)
  • GPU Used: NVIDIA Tesla P100 (21GB VRAM)
  • Precision: float16
  • Optimizer: adamw_torch_fused
  • Batch Size: 8
  • Gradient Accumulation Steps: 2
  • Max Prompt Length: 128
  • Max Completion Length: 100
  • Epochs: 1
  • Learning Rate: 5e-6
  • LR Scheduler: cosine
  • Weight Decay: 0.2
  • Logging Steps: 1
  • FP16 Enabled: βœ…

πŸ† Reward Functions Used

The model was optimized using the following reward functions:

  1. xmlcount_reward_func
  2. soft_format_reward_func
  3. strict_format_reward_func
  4. int_reward_func
  5. correctness_reward_func

πŸ“ Dataset Details

The model was trained on a subset of the GSM8K dataset. The dataset was processed as follows:

  • The first 1500 samples were selected to reduce training time.
  • Each training sample consisted of a question (prompt) and a ground truth answer extracted using:
    def extract_hash_answer(text: str) -> str | None:
        if "####" not in text:
            return None
        return text.split("####")[1].strip()
    
  • The dataset was loaded and formatted using:
    def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
        data = load_dataset('openai/gsm8k', 'main')[split]
        data = data.shuffle(seed=42).select(range(num_samples))  # Selecting 1500 samples
        data = data.map(lambda x: {
            'prompt': [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': x['question']}
            ],
            'answer': extract_hash_answer(x['answer'])
        })
        return data
    

⚑ Performance & Limitations

  • The model was fine-tuned on limited data (1500 samples instead of the full dataset).
  • Due to hardware constraints (P100, 21GB VRAM), some training optimizations were made to improve efficiency.
  • The model is expected to perform well on mathematical reasoning tasks but may have limited generalization due to the small training set.

πŸ”§ How to Use

You can use this model with Hugging Face Transformers as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/SmolLM2-135M-GRPO"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate output
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=100)

print(tokenizer.decode(output[0], skip_special_tokens=True))

πŸš€ Acknowledgements

  • Hugging Face Team for SmolLM2-135M
  • OpenAI GSM8K dataset
  • GRPO fine-tuning technique for reward-based optimization

πŸ“Œ Future Work

  • Increase dataset size for better generalization.
  • Optimize training on larger GPUs (e.g., A100, H100).
  • Experiment with different reward functions to improve accuracy.
Downloads last month
0
Safetensors
Model size
135M params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Macromrit/SmolLM2-135M-GRPO-Trained-For-Reasoning

Finetuned
(116)
this model

Dataset used to train Macromrit/SmolLM2-135M-GRPO-Trained-For-Reasoning