Update README.md

d95fa21 verified about 23 hours ago

3.84 kB

	---
	license: mit
	datasets:
	- openai/gsm8k
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolLM2-135M-Instruct
	---

	![GRPO Training Overview](GRPO.png)

	# SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)

	## 📌 Model Summary
	This is a SmolLM2-135M model fine-tuned using the Guided Reward Policy Optimization (GRPO) technique on a subset of the GSM8K dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a DPU P-100 accelerator with 21GB VRAM.

	## 📊 Training Details

	### 🛠 Training Configuration
	- Base Model: [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co./HuggingFaceTB/SmolLM2-135M-Instruct)
	- Fine-Tuning Technique: GRPO (Guided Reward Policy Optimization)
	- Dataset: GSM8K (first 1500 samples)
	- GPU Used: NVIDIA Tesla P100 (21GB VRAM)
	- Precision: `float16`
	- Optimizer: `adamw_torch_fused`
	- Batch Size: `8`
	- Gradient Accumulation Steps: `2`
	- Max Prompt Length: `128`
	- Max Completion Length: `100`
	- Epochs: `1`
	- Learning Rate: `5e-6`
	- LR Scheduler: `cosine`
	- Weight Decay: `0.2`
	- Logging Steps: `1`
	- FP16 Enabled: ✅

	### 🏆 Reward Functions Used
	The model was optimized using the following reward functions:
	1. `xmlcount_reward_func`
	2. `soft_format_reward_func`
	3. `strict_format_reward_func`
	4. `int_reward_func`
	5. `correctness_reward_func`

	## 📝 Dataset Details
	The model was trained on a subset of the GSM8K dataset. The dataset was processed as follows:
	- The first 1500 samples were selected to reduce training time.
	- Each training sample consisted of a question (prompt) and a ground truth answer extracted using:
	```python
	def extract_hash_answer(text: str) -> str \| None:
	if "####" not in text:
	return None
	return text.split("####")[1].strip()
	```
	- The dataset was loaded and formatted using:
	```python
	def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
	data = load_dataset('openai/gsm8k', 'main')[split]
	data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples
	data = data.map(lambda x: {
	'prompt': [
	{'role': 'system', 'content': SYSTEM_PROMPT},
	{'role': 'user', 'content': x['question']}
	],
	'answer': extract_hash_answer(x['answer'])
	})
	return data
	```

	## ⚡ Performance & Limitations
	- The model was fine-tuned on limited data (1500 samples instead of the full dataset).
	- Due to hardware constraints (P100, 21GB VRAM), some training optimizations were made to improve efficiency.
	- The model is expected to perform well on mathematical reasoning tasks but may have limited generalization due to the small training set.

	## 🔧 How to Use
	You can use this model with Hugging Face Transformers as follows:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "your-username/SmolLM2-135M-GRPO"

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Generate output
	prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
	inputs = tokenizer(prompt, return_tensors="pt")
	output = model.generate(**inputs, max_length=100)

	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	## 🚀 Acknowledgements
	- Hugging Face Team for SmolLM2-135M
	- OpenAI GSM8K dataset
	- GRPO fine-tuning technique for reward-based optimization

	## 📌 Future Work
	- Increase dataset size for better generalization.
	- Optimize training on larger GPUs (e.g., A100, H100).
	- Experiment with different reward functions to improve accuracy.