lapp0's picture
End of training
a9ba7a6 verified
|
raw
history blame
4.4 kB
metadata
base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 99.0
  • eval_frwikippl: 408.0
  • eval_zhwikippl: 149.0
  • eval_tinystoriesppl: 74.5
  • eval_loss: 0.7768
  • eval_runtime: 16.7488
  • eval_samples_per_second: 59.706
  • eval_steps_per_second: 7.463

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • lr_scheduler_warmup_ratio: 0.2
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.4226 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.75 61.75 11.8125 19.125
0 0 10027008.0 5734400.0 11.3280 16.7303 59.772 7.471 5865472.0 3637248.0
1000 0.0404 382.0 2016.0 1.6270 16.7162 59.822 7.478 298.0 1200.0
2000 0.0808 270.0 988.0 1.4340 16.7294 59.775 7.472 227.0 390.0
3000 0.1212 218.0 736.0 1.2960 16.686 59.931 7.491 185.0 239.0
4000 0.1616 182.0 708.0 1.1629 16.7199 59.809 7.476 144.0 213.0
5000 0.2020 155.0 620.0 1.0601 16.7539 59.688 7.461 117.0 230.0
6000 0.2424 131.0 496.0 0.9643 16.7329 59.763 7.47 104.5 193.0
7000 0.2828 122.0 512.0 0.9001 16.7515 59.696 7.462 91.5 193.0
8000 0.3232 111.5 428.0 0.8336 16.677 59.963 7.495 82.5 161.0
9000 0.3636 99.0 408.0 0.7768 16.7488 59.706 7.463 74.5 149.0
10000 0.4040 89.5 386.0 0.7219 16.6219 60.162 7.52 71.5 140.0
11000 0.4444 82.5 332.0 0.6595 16.6753 59.969 7.496 67.0 148.0
12000 0.4848 78.0 300.0 0.6302 16.6631 60.013 7.502 60.75 131.0
13000 0.5253 73.5 292.0 0.6019 16.7166 59.821 7.478 59.5 117.0
14000 0.5657 75.0 284.0 0.5861 16.7002 59.88 7.485 59.5 137.0
15000 0.6061 71.5 252.0 0.5722 16.6732 59.976 7.497 55.25 130.0
16000 0.6465 70.0 250.0 0.5545 16.6934 59.904 7.488 57.75 104.5
17000 0.6869 70.0 272.0 0.5426 16.6888 59.92 7.49 55.5 130.0
18000 0.7273 70.0 248.0 0.5380 16.6762 59.966 7.496 53.75 124.0
19000 0.7677 68.5 227.0 0.5270 16.6682 59.994 7.499 53.5 96.5
20000 0.8081 65.5 219.0 0.5260 16.6778 59.96 7.495 52.0 129.0
21000 0.8485 68.0 228.0 0.5154 16.7388 59.741 7.468 52.0 140.0
22000 0.8889 68.5 246.0 0.5128 16.6637 60.011 7.501 51.75 216.0
23000 0.9293 64.5 245.0 0.5029 16.7201 59.808 7.476 52.5 146.0
24000 0.9697 66.5 230.0 0.5067 16.7059 59.859 7.482 51.25 168.0
24750 1.0 65.5 228.0 0.5042 16.685 59.934 7.492 51.25 100.5

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0