--- base_model: gpt2 library_name: Distily license: mit tags: - generated_from_trainer model-index: - name: distily_bench_obj_cross_v2.15_gpt2 results: [] --- # distily_bench_obj_cross_v2.15_gpt2 This student model is distilled from the teacher model [gpt2](https://huggingface.co./gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 84.5 - eval_frwikippl: 356.0 - eval_zhwikippl: 135.0 - eval_tinystoriesppl: 72.0 - eval_loss: 0.6795 - eval_runtime: 16.7299 - eval_samples_per_second: 59.773 - eval_steps_per_second: 7.472 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None)) - train_embeddings: True - learning_rate: 0.0001 - train_batch_size: 4 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - lr_scheduler_warmup_ratio: 0.2 - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 7.4226 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 | | 0 | 0 | 3126736191488.0 | 129742372077568.0 | 20.7540 | 16.7407 | 59.735 | 7.467 | 6677331968.0 | 80264348827648.0 | | 1000 | 0.0404 | 320.0 | 1504.0 | 1.5025 | 16.7846 | 59.579 | 7.447 | 245.0 | 280.0 | | 2000 | 0.0808 | 220.0 | 800.0 | 1.3040 | 16.7756 | 59.61 | 7.451 | 189.0 | 201.0 | | 3000 | 0.1212 | 180.0 | 648.0 | 1.1450 | 16.7863 | 59.572 | 7.447 | 153.0 | 149.0 | | 4000 | 0.1616 | 148.0 | 552.0 | 1.0301 | 16.7242 | 59.794 | 7.474 | 121.5 | 153.0 | | 5000 | 0.2020 | 129.0 | 452.0 | 0.9348 | 16.7817 | 59.589 | 7.449 | 105.0 | 176.0 | | 6000 | 0.2424 | 115.5 | 442.0 | 0.8587 | 16.8358 | 59.397 | 7.425 | 86.0 | 139.0 | | 7000 | 0.2828 | 103.0 | 432.0 | 0.8002 | 16.7689 | 59.634 | 7.454 | 78.5 | 139.0 | | 8000 | 0.3232 | 96.5 | 418.0 | 0.7424 | 16.7778 | 59.602 | 7.45 | 73.5 | 126.0 | | 9000 | 0.3636 | 84.5 | 356.0 | 0.6795 | 16.7299 | 59.773 | 7.472 | 72.0 | 135.0 | | 10000 | 0.4040 | 81.5 | 304.0 | 0.6324 | 16.7186 | 59.813 | 7.477 | 66.0 | 125.5 | | 11000 | 0.4444 | 77.5 | 282.0 | 0.5972 | 16.777 | 59.605 | 7.451 | 59.25 | 121.5 | | 12000 | 0.4848 | 72.5 | 288.0 | 0.5723 | 16.7347 | 59.756 | 7.47 | 56.75 | 118.0 | | 13000 | 0.5253 | 69.5 | 256.0 | 0.5577 | 16.7525 | 59.693 | 7.462 | 55.5 | 141.0 | | 14000 | 0.5657 | 68.5 | 237.0 | 0.5389 | 16.7317 | 59.767 | 7.471 | 54.75 | 286.0 | | 15000 | 0.6061 | 67.5 | 252.0 | 0.5187 | 16.7326 | 59.764 | 7.47 | 52.25 | 98.5 | | 16000 | 0.6465 | 69.0 | 235.0 | 0.5174 | 16.8095 | 59.49 | 7.436 | 54.75 | 125.5 | | 17000 | 0.6869 | 67.0 | 231.0 | 0.5048 | 16.7326 | 59.764 | 7.47 | 50.5 | 116.0 | | 18000 | 0.7273 | 66.0 | 225.0 | 0.4909 | 16.7575 | 59.675 | 7.459 | 49.75 | 132.0 | | 19000 | 0.7677 | 66.5 | 247.0 | 0.4894 | 16.8313 | 59.413 | 7.427 | 49.75 | 112.0 | | 20000 | 0.8081 | 66.5 | 233.0 | 0.4870 | 16.7365 | 59.75 | 7.469 | 51.5 | 103.5 | | 21000 | 0.8485 | 65.0 | 221.0 | 0.4831 | 16.703 | 59.869 | 7.484 | 50.75 | 181.0 | | 22000 | 0.8889 | 65.5 | 199.0 | 0.4740 | 16.7629 | 59.656 | 7.457 | 49.5 | 95.5 | | 23000 | 0.9293 | 67.0 | 223.0 | 0.4752 | 16.7201 | 59.808 | 7.476 | 46.5 | 174.0 | | 24000 | 0.9697 | 65.0 | 207.0 | 0.4700 | 16.8026 | 59.515 | 7.439 | 46.75 | 98.5 | | 24750 | 1.0 | 67.0 | 207.0 | 0.4672 | 16.7876 | 59.568 | 7.446 | 47.0 | 185.0 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.21.0