Update README.md
Browse files
README.md
CHANGED
@@ -248,6 +248,58 @@ outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
|
|
248 |
* **Output:** Generated English-language text in response to the input, such
|
249 |
as an answer to a question, or a summary of a document.
|
250 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
251 |
## Model Card Authors
|
252 |
[More Information Needed]
|
253 |
|
|
|
248 |
* **Output:** Generated English-language text in response to the input, such
|
249 |
as an answer to a question, or a summary of a document.
|
250 |
|
251 |
+
#### Training Hyperparameters
|
252 |
+
|
253 |
+
The following hyperparameters were used during training:
|
254 |
+
|
255 |
+
- **learning_rate:** `3e-4`
|
256 |
+
- **train_batch_size:** Effectively adjusted by `per_device_train_batch_size=1` and `gradient_accumulation_steps=4`
|
257 |
+
- **eval_batch_size:** Implicitly determined by the evaluation setup (not explicitly defined)
|
258 |
+
- **seed:** Not explicitly stated, crucial for ensuring reproducibility
|
259 |
+
- **optimizer:** `paged_adamw_8bit`, designed for efficient memory utilization
|
260 |
+
- **lr_scheduler_type:** Learning rate adjustments indicate adaptive scheduling, though specific type is not mentioned
|
261 |
+
- **training_steps:** `500`
|
262 |
+
- **mixed_precision_training:** Not explicitly mentioned; any applied strategy would aim at computational efficiency
|
263 |
+
|
264 |
+
#### Training Results
|
265 |
+
|
266 |
+
Below is a summary of the training results at every 25th step, showcasing the training loss, gradient norm, learning rate, and corresponding epoch:
|
267 |
+
|
268 |
+
```plaintext
|
269 |
+
| Training Step | Training Loss | Grad Norm | Learning Rate | Epoch |
|
270 |
+
|---------------|---------------|-----------|-----------------------------|-------|
|
271 |
+
| 1 | 2.1426 | 1.333079 | 0.0002975951903807615 | 0.04 |
|
272 |
+
| 25 | 1.1061 | 0.756779 | 0.0002855711422845691 | 0.22 |
|
273 |
+
| 50 | 0.8865 | 0.601220 | 0.00027054108216432863 | 0.44 |
|
274 |
+
| 75 | 0.9921 | 0.634705 | 0.00025551102204408817 | 0.67 |
|
275 |
+
| 100 | 0.8814 | 0.594633 | 0.00024048096192384768 | 0.89 |
|
276 |
+
| 125 | 0.5098 | 0.787081 | 0.0002254509018036072 | 1.11 |
|
277 |
+
| 150 | 0.4647 | 0.577686 | 0.00021042084168336673 | 1.33 |
|
278 |
+
| 175 | 0.4096 | 0.687792 | 0.00019539078156312624 | 1.55 |
|
279 |
+
| 200 | 0.5006 | 0.669076 | 0.00018036072144288578 | 1.77 |
|
280 |
+
| 225 | 0.5101 | 0.676769 | 0.00016533066132264526 | 2.0 |
|
281 |
+
| 250 | 0.1939 | 0.656288 | 0.00015030060120240478 | 2.22 |
|
282 |
+
| 275 | 0.2506 | 0.620012 | 0.00013527054108216431 | 2.44 |
|
283 |
+
| 300 | 0.2050 | 0.642024 | 0.00012024048096192384 | 2.66 |
|
284 |
+
| 325 | 0.3296 | 0.553642 | 0.00010521042084168336 | 2.88 |
|
285 |
+
| 350 | 0.0799 | 0.331929 | 9.018036072144289e-05 | 3.1 |
|
286 |
+
| 375 | 0.0951 | 0.682525 | 7.515030060120239e-05 | 3.33 |
|
287 |
+
| 400 | 0.0927 | 0.438669 | 6.012024048096192e-05 | 3.55 |
|
288 |
+
| 425 | 0.0845 | 0.422025 | 4.5090180360721445e-05 | 3.77 |
|
289 |
+
| 450 | 0.2115 | 0.718012 | 3.006012024048096e-05 | 3.99 |
|
290 |
+
| 475 | 0.0538 | 0.167244 | 1.503006012024048e-05 | 4.21 |
|
291 |
+
| 500 | 0.0438 | 0.184941 | 0.0 | 4.43 |
|
292 |
+
|
293 |
+
#### Final Training Summary
|
294 |
+
|
295 |
+
| Metric | Value |
|
296 |
+
|--------------------------|-----------------------|
|
297 |
+
| Train Runtime | 2457.436s |
|
298 |
+
| Train Samples per Second | 0.814 |
|
299 |
+
| Train Steps per Second | 0.203 |
|
300 |
+
| Train Loss | 0.42669185039401053 |
|
301 |
+
| Epoch | 4.43 |
|
302 |
+
|
303 |
## Model Card Authors
|
304 |
[More Information Needed]
|
305 |
|