Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,103 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- openai/gsm8k
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- HuggingFaceTB/SmolLM2-135M-Instruct
|
9 |
+
---
|
10 |
+
|
11 |
+
|
12 |
+
# SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)
|
13 |
+
|
14 |
+
## π Model Summary
|
15 |
+
This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**.
|
16 |
+
|
17 |
+
## π Training Details
|
18 |
+
|
19 |
+
### **π Training Configuration**
|
20 |
+
- **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct)
|
21 |
+
- **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization)
|
22 |
+
- **Dataset:** GSM8K (first 1500 samples)
|
23 |
+
- **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM)
|
24 |
+
- **Precision:** `float16`
|
25 |
+
- **Optimizer:** `adamw_torch_fused`
|
26 |
+
- **Batch Size:** `8`
|
27 |
+
- **Gradient Accumulation Steps:** `2`
|
28 |
+
- **Max Prompt Length:** `128`
|
29 |
+
- **Max Completion Length:** `100`
|
30 |
+
- **Epochs:** `1`
|
31 |
+
- **Learning Rate:** `5e-6`
|
32 |
+
- **LR Scheduler:** `cosine`
|
33 |
+
- **Weight Decay:** `0.2`
|
34 |
+
- **Logging Steps:** `1`
|
35 |
+
- **FP16 Enabled:** β
|
36 |
+
|
37 |
+
### **π Reward Functions Used**
|
38 |
+
The model was optimized using the following reward functions:
|
39 |
+
1. **`xmlcount_reward_func`**
|
40 |
+
2. **`soft_format_reward_func`**
|
41 |
+
3. **`strict_format_reward_func`**
|
42 |
+
4. **`int_reward_func`**
|
43 |
+
5. **`correctness_reward_func`**
|
44 |
+
|
45 |
+
## π Dataset Details
|
46 |
+
The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows:
|
47 |
+
- The **first 1500 samples** were selected to reduce training time.
|
48 |
+
- Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using:
|
49 |
+
```python
|
50 |
+
def extract_hash_answer(text: str) -> str | None:
|
51 |
+
if "####" not in text:
|
52 |
+
return None
|
53 |
+
return text.split("####")[1].strip()
|
54 |
+
```
|
55 |
+
- The dataset was loaded and formatted using:
|
56 |
+
```python
|
57 |
+
def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
|
58 |
+
data = load_dataset('openai/gsm8k', 'main')[split]
|
59 |
+
data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples
|
60 |
+
data = data.map(lambda x: {
|
61 |
+
'prompt': [
|
62 |
+
{'role': 'system', 'content': SYSTEM_PROMPT},
|
63 |
+
{'role': 'user', 'content': x['question']}
|
64 |
+
],
|
65 |
+
'answer': extract_hash_answer(x['answer'])
|
66 |
+
})
|
67 |
+
return data
|
68 |
+
```
|
69 |
+
|
70 |
+
## β‘ Performance & Limitations
|
71 |
+
- The model was **fine-tuned on limited data** (1500 samples instead of the full dataset).
|
72 |
+
- Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency.
|
73 |
+
- The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set.
|
74 |
+
|
75 |
+
## π§ How to Use
|
76 |
+
You can use this model with **Hugging Face Transformers** as follows:
|
77 |
+
|
78 |
+
```python
|
79 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
80 |
+
|
81 |
+
model_name = "your-username/SmolLM2-135M-GRPO"
|
82 |
+
|
83 |
+
# Load model and tokenizer
|
84 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
85 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
86 |
+
|
87 |
+
# Generate output
|
88 |
+
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
|
89 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
90 |
+
output = model.generate(**inputs, max_length=100)
|
91 |
+
|
92 |
+
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
93 |
+
```
|
94 |
+
|
95 |
+
## π Acknowledgements
|
96 |
+
- **Hugging Face Team** for **SmolLM2-135M**
|
97 |
+
- **OpenAI GSM8K dataset**
|
98 |
+
- **GRPO fine-tuning technique** for reward-based optimization
|
99 |
+
|
100 |
+
## π Future Work
|
101 |
+
- **Increase dataset size** for better generalization.
|
102 |
+
- **Optimize training on larger GPUs** (e.g., A100, H100).
|
103 |
+
- **Experiment with different reward functions** to improve accuracy.
|