Macromrit's picture
Update README.md
4682436 verified
|
raw
history blame
3.8 kB
---
license: mit
datasets:
- openai/gsm8k
language:
- en
base_model:
- HuggingFaceTB/SmolLM2-135M-Instruct
---
# SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)
## πŸ“Œ Model Summary
This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**.
## πŸ“Š Training Details
### **πŸ›  Training Configuration**
- **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co./HuggingFaceTB/SmolLM2-135M-Instruct)
- **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization)
- **Dataset:** GSM8K (first 1500 samples)
- **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM)
- **Precision:** `float16`
- **Optimizer:** `adamw_torch_fused`
- **Batch Size:** `8`
- **Gradient Accumulation Steps:** `2`
- **Max Prompt Length:** `128`
- **Max Completion Length:** `100`
- **Epochs:** `1`
- **Learning Rate:** `5e-6`
- **LR Scheduler:** `cosine`
- **Weight Decay:** `0.2`
- **Logging Steps:** `1`
- **FP16 Enabled:** βœ…
### **πŸ† Reward Functions Used**
The model was optimized using the following reward functions:
1. **`xmlcount_reward_func`**
2. **`soft_format_reward_func`**
3. **`strict_format_reward_func`**
4. **`int_reward_func`**
5. **`correctness_reward_func`**
## πŸ“ Dataset Details
The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows:
- The **first 1500 samples** were selected to reduce training time.
- Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using:
```python
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
```
- The dataset was loaded and formatted using:
```python
def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
data = load_dataset('openai/gsm8k', 'main')[split]
data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples
data = data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_hash_answer(x['answer'])
})
return data
```
## ⚑ Performance & Limitations
- The model was **fine-tuned on limited data** (1500 samples instead of the full dataset).
- Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency.
- The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set.
## πŸ”§ How to Use
You can use this model with **Hugging Face Transformers** as follows:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your-username/SmolLM2-135M-GRPO"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate output
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## πŸš€ Acknowledgements
- **Hugging Face Team** for **SmolLM2-135M**
- **OpenAI GSM8K dataset**
- **GRPO fine-tuning technique** for reward-based optimization
## πŸ“Œ Future Work
- **Increase dataset size** for better generalization.
- **Optimize training on larger GPUs** (e.g., A100, H100).
- **Experiment with different reward functions** to improve accuracy.