|
--- |
|
license: mit |
|
datasets: |
|
- openai/gsm8k |
|
language: |
|
- en |
|
base_model: |
|
- HuggingFaceTB/SmolLM2-135M-Instruct |
|
--- |
|
|
|
 |
|
|
|
# SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples) |
|
|
|
## π Model Summary |
|
This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**. |
|
|
|
## π Training Details |
|
|
|
### **π Training Configuration** |
|
- **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co./HuggingFaceTB/SmolLM2-135M-Instruct) |
|
- **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization) |
|
- **Dataset:** GSM8K (first 1500 samples) |
|
- **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM) |
|
- **Precision:** `float16` |
|
- **Optimizer:** `adamw_torch_fused` |
|
- **Batch Size:** `8` |
|
- **Gradient Accumulation Steps:** `2` |
|
- **Max Prompt Length:** `128` |
|
- **Max Completion Length:** `100` |
|
- **Epochs:** `1` |
|
- **Learning Rate:** `5e-6` |
|
- **LR Scheduler:** `cosine` |
|
- **Weight Decay:** `0.2` |
|
- **Logging Steps:** `1` |
|
- **FP16 Enabled:** β
|
|
|
|
### **π Reward Functions Used** |
|
The model was optimized using the following reward functions: |
|
1. **`xmlcount_reward_func`** |
|
2. **`soft_format_reward_func`** |
|
3. **`strict_format_reward_func`** |
|
4. **`int_reward_func`** |
|
5. **`correctness_reward_func`** |
|
|
|
## π Dataset Details |
|
The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows: |
|
- The **first 1500 samples** were selected to reduce training time. |
|
- Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using: |
|
```python |
|
def extract_hash_answer(text: str) -> str | None: |
|
if "####" not in text: |
|
return None |
|
return text.split("####")[1].strip() |
|
``` |
|
- The dataset was loaded and formatted using: |
|
```python |
|
def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset: |
|
data = load_dataset('openai/gsm8k', 'main')[split] |
|
data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples |
|
data = data.map(lambda x: { |
|
'prompt': [ |
|
{'role': 'system', 'content': SYSTEM_PROMPT}, |
|
{'role': 'user', 'content': x['question']} |
|
], |
|
'answer': extract_hash_answer(x['answer']) |
|
}) |
|
return data |
|
``` |
|
|
|
## β‘ Performance & Limitations |
|
- The model was **fine-tuned on limited data** (1500 samples instead of the full dataset). |
|
- Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency. |
|
- The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set. |
|
|
|
## π§ How to Use |
|
You can use this model with **Hugging Face Transformers** as follows: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "your-username/SmolLM2-135M-GRPO" |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Generate output |
|
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
output = model.generate(**inputs, max_length=100) |
|
|
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## π Acknowledgements |
|
- **Hugging Face Team** for **SmolLM2-135M** |
|
- **OpenAI GSM8K dataset** |
|
- **GRPO fine-tuning technique** for reward-based optimization |
|
|
|
## π Future Work |
|
- **Increase dataset size** for better generalization. |
|
- **Optimize training on larger GPUs** (e.g., A100, H100). |
|
- **Experiment with different reward functions** to improve accuracy. |