Macromrit commited on
Commit
4682436
Β·
verified Β·
1 Parent(s): 9afa2ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -3
README.md CHANGED
@@ -1,3 +1,103 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - openai/gsm8k
5
+ language:
6
+ - en
7
+ base_model:
8
+ - HuggingFaceTB/SmolLM2-135M-Instruct
9
+ ---
10
+
11
+
12
+ # SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)
13
+
14
+ ## πŸ“Œ Model Summary
15
+ This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**.
16
+
17
+ ## πŸ“Š Training Details
18
+
19
+ ### **πŸ›  Training Configuration**
20
+ - **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct)
21
+ - **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization)
22
+ - **Dataset:** GSM8K (first 1500 samples)
23
+ - **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM)
24
+ - **Precision:** `float16`
25
+ - **Optimizer:** `adamw_torch_fused`
26
+ - **Batch Size:** `8`
27
+ - **Gradient Accumulation Steps:** `2`
28
+ - **Max Prompt Length:** `128`
29
+ - **Max Completion Length:** `100`
30
+ - **Epochs:** `1`
31
+ - **Learning Rate:** `5e-6`
32
+ - **LR Scheduler:** `cosine`
33
+ - **Weight Decay:** `0.2`
34
+ - **Logging Steps:** `1`
35
+ - **FP16 Enabled:** βœ…
36
+
37
+ ### **πŸ† Reward Functions Used**
38
+ The model was optimized using the following reward functions:
39
+ 1. **`xmlcount_reward_func`**
40
+ 2. **`soft_format_reward_func`**
41
+ 3. **`strict_format_reward_func`**
42
+ 4. **`int_reward_func`**
43
+ 5. **`correctness_reward_func`**
44
+
45
+ ## πŸ“ Dataset Details
46
+ The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows:
47
+ - The **first 1500 samples** were selected to reduce training time.
48
+ - Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using:
49
+ ```python
50
+ def extract_hash_answer(text: str) -> str | None:
51
+ if "####" not in text:
52
+ return None
53
+ return text.split("####")[1].strip()
54
+ ```
55
+ - The dataset was loaded and formatted using:
56
+ ```python
57
+ def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
58
+ data = load_dataset('openai/gsm8k', 'main')[split]
59
+ data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples
60
+ data = data.map(lambda x: {
61
+ 'prompt': [
62
+ {'role': 'system', 'content': SYSTEM_PROMPT},
63
+ {'role': 'user', 'content': x['question']}
64
+ ],
65
+ 'answer': extract_hash_answer(x['answer'])
66
+ })
67
+ return data
68
+ ```
69
+
70
+ ## ⚑ Performance & Limitations
71
+ - The model was **fine-tuned on limited data** (1500 samples instead of the full dataset).
72
+ - Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency.
73
+ - The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set.
74
+
75
+ ## πŸ”§ How to Use
76
+ You can use this model with **Hugging Face Transformers** as follows:
77
+
78
+ ```python
79
+ from transformers import AutoModelForCausalLM, AutoTokenizer
80
+
81
+ model_name = "your-username/SmolLM2-135M-GRPO"
82
+
83
+ # Load model and tokenizer
84
+ model = AutoModelForCausalLM.from_pretrained(model_name)
85
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
86
+
87
+ # Generate output
88
+ prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
89
+ inputs = tokenizer(prompt, return_tensors="pt")
90
+ output = model.generate(**inputs, max_length=100)
91
+
92
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
93
+ ```
94
+
95
+ ## πŸš€ Acknowledgements
96
+ - **Hugging Face Team** for **SmolLM2-135M**
97
+ - **OpenAI GSM8K dataset**
98
+ - **GRPO fine-tuning technique** for reward-based optimization
99
+
100
+ ## πŸ“Œ Future Work
101
+ - **Increase dataset size** for better generalization.
102
+ - **Optimize training on larger GPUs** (e.g., A100, H100).
103
+ - **Experiment with different reward functions** to improve accuracy.