Macromrit
/

SmolLM2-135M-GRPO-Trained-For-Reasoning

Safetensors

English

llama

Model card Files Files and versions Community

Macromrit commited on about 23 hours ago

Commit

4682436

verified ·

1 Parent(s): 9afa2ee

Update README.md

Browse files

Files changed (1) hide show

README.md +103 -3

README.md CHANGED Viewed

@@ -1,3 +1,103 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- openai/gsm8k
+language:
+- en
+base_model:
+- HuggingFaceTB/SmolLM2-135M-Instruct
+---
+# SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)
+## 📌 Model Summary
+This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**.
+## 📊 Training Details
+### **🛠 Training Configuration**
+- **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct)
+- **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization)
+- **Dataset:** GSM8K (first 1500 samples)
+- **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM)
+- **Precision:** `float16`
+- **Optimizer:** `adamw_torch_fused`
+- **Batch Size:** `8`
+- **Gradient Accumulation Steps:** `2`
+- **Max Prompt Length:** `128`
+- **Max Completion Length:** `100`
+- **Epochs:** `1`
+- **Learning Rate:** `5e-6`
+- **LR Scheduler:** `cosine`
+- **Weight Decay:** `0.2`
+- **Logging Steps:** `1`
+- **FP16 Enabled:** ✅
+### **🏆 Reward Functions Used**
+The model was optimized using the following reward functions:
+1. **`xmlcount_reward_func`**
+2. **`soft_format_reward_func`**
+3. **`strict_format_reward_func`**
+4. **`int_reward_func`**
+5. **`correctness_reward_func`**
+## 📝 Dataset Details
+The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows:
+- The **first 1500 samples** were selected to reduce training time.
+- Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using:
+  ```python
+  def extract_hash_answer(text: str) -> str | None:
+      if "####" not in text:
+          return None
+      return text.split("####")[1].strip()
+  ```
+- The dataset was loaded and formatted using:
+  ```python
+  def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
+      data = load_dataset('openai/gsm8k', 'main')[split]
+      data = data.shuffle(seed=42).select(range(num_samples))  # Selecting 1500 samples
+      data = data.map(lambda x: {
+          'prompt': [
+              {'role': 'system', 'content': SYSTEM_PROMPT},
+              {'role': 'user', 'content': x['question']}
+          ],
+          'answer': extract_hash_answer(x['answer'])
+      })
+      return data
+  ```
+## ⚡ Performance & Limitations
+- The model was **fine-tuned on limited data** (1500 samples instead of the full dataset).
+- Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency.
+- The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set.
+## 🔧 How to Use
+You can use this model with **Hugging Face Transformers** as follows:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "your-username/SmolLM2-135M-GRPO"
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Generate output
+prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
+inputs = tokenizer(prompt, return_tensors="pt")
+output = model.generate(**inputs, max_length=100)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+## 🚀 Acknowledgements
+- **Hugging Face Team** for **SmolLM2-135M**
+- **OpenAI GSM8K dataset**
+- **GRPO fine-tuning technique** for reward-based optimization
+## 📌 Future Work
+- **Increase dataset size** for better generalization.
+- **Optimize training on larger GPUs** (e.g., A100, H100).
+- **Experiment with different reward functions** to improve accuracy.