SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)

📌 Model Summary

This is a SmolLM2-135M model fine-tuned using the Guided Reward Policy Optimization (GRPO) technique on a subset of the GSM8K dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a DPU P-100 accelerator with 21GB VRAM.

📊 Training Details

🛠 Training Configuration

Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
Fine-Tuning Technique: GRPO (Guided Reward Policy Optimization)
Dataset: GSM8K (first 1500 samples)
GPU Used: NVIDIA Tesla P100 (21GB VRAM)
Precision: float16
Optimizer: adamw_torch_fused
Batch Size: 8
Gradient Accumulation Steps: 2
Max Prompt Length: 128
Max Completion Length: 100
Epochs: 1
Learning Rate: 5e-6
LR Scheduler: cosine
Weight Decay: 0.2
Logging Steps: 1
FP16 Enabled: ✅

🏆 Reward Functions Used

The model was optimized using the following reward functions:

xmlcount_reward_func
soft_format_reward_func
strict_format_reward_func
int_reward_func
correctness_reward_func

📝 Dataset Details

The model was trained on a subset of the GSM8K dataset. The dataset was processed as follows:

The first 1500 samples were selected to reduce training time.

Each training sample consisted of a question (prompt) and a ground truth answer extracted using:

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

The dataset was loaded and formatted using:

def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split]
    data = data.shuffle(seed=42).select(range(num_samples))  # Selecting 1500 samples
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    })
    return data

⚡ Performance & Limitations

The model was fine-tuned on limited data (1500 samples instead of the full dataset).
Due to hardware constraints (P100, 21GB VRAM), some training optimizations were made to improve efficiency.
The model is expected to perform well on mathematical reasoning tasks but may have limited generalization due to the small training set.

🔧 How to Use

You can use this model with Hugging Face Transformers as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/SmolLM2-135M-GRPO"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate output
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=100)

print(tokenizer.decode(output[0], skip_special_tokens=True))

🚀 Acknowledgements

Hugging Face Team for SmolLM2-135M
OpenAI GSM8K dataset
GRPO fine-tuning technique for reward-based optimization

📌 Future Work

Increase dataset size for better generalization.
Optimize training on larger GPUs (e.g., A100, H100).
Experiment with different reward functions to improve accuracy.

Macromrit
/

SmolLM2-135M-GRPO-Trained-For-Reasoning