metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 84.5
eval_frwikippl: 356.0
eval_zhwikippl: 135.0
eval_tinystoriesppl: 72.0
eval_loss: 0.6795
eval_runtime: 16.7299
eval_samples_per_second: 59.773
eval_steps_per_second: 7.472

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.4226 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	3126736191488.0	129742372077568.0	20.7540	16.7407	59.735	7.467	6677331968.0	80264348827648.0
1000	0.0404	320.0	1504.0	1.5025	16.7846	59.579	7.447	245.0	280.0
2000	0.0808	220.0	800.0	1.3040	16.7756	59.61	7.451	189.0	201.0
3000	0.1212	180.0	648.0	1.1450	16.7863	59.572	7.447	153.0	149.0
4000	0.1616	148.0	552.0	1.0301	16.7242	59.794	7.474	121.5	153.0
5000	0.2020	129.0	452.0	0.9348	16.7817	59.589	7.449	105.0	176.0
6000	0.2424	115.5	442.0	0.8587	16.8358	59.397	7.425	86.0	139.0
7000	0.2828	103.0	432.0	0.8002	16.7689	59.634	7.454	78.5	139.0
8000	0.3232	96.5	418.0	0.7424	16.7778	59.602	7.45	73.5	126.0
9000	0.3636	84.5	356.0	0.6795	16.7299	59.773	7.472	72.0	135.0
10000	0.4040	81.5	304.0	0.6324	16.7186	59.813	7.477	66.0	125.5
11000	0.4444	77.5	282.0	0.5972	16.777	59.605	7.451	59.25	121.5
12000	0.4848	72.5	288.0	0.5723	16.7347	59.756	7.47	56.75	118.0
13000	0.5253	69.5	256.0	0.5577	16.7525	59.693	7.462	55.5	141.0
14000	0.5657	68.5	237.0	0.5389	16.7317	59.767	7.471	54.75	286.0
15000	0.6061	67.5	252.0	0.5187	16.7326	59.764	7.47	52.25	98.5
16000	0.6465	69.0	235.0	0.5174	16.8095	59.49	7.436	54.75	125.5
17000	0.6869	67.0	231.0	0.5048	16.7326	59.764	7.47	50.5	116.0
18000	0.7273	66.0	225.0	0.4909	16.7575	59.675	7.459	49.75	132.0
19000	0.7677	66.5	247.0	0.4894	16.8313	59.413	7.427	49.75	112.0
20000	0.8081	66.5	233.0	0.4870	16.7365	59.75	7.469	51.5	103.5
21000	0.8485	65.0	221.0	0.4831	16.703	59.869	7.484	50.75	181.0
22000	0.8889	65.5	199.0	0.4740	16.7629	59.656	7.457	49.5	95.5
23000	0.9293	67.0	223.0	0.4752	16.7201	59.808	7.476	46.5	174.0
24000	0.9697	65.0	207.0	0.4700	16.8026	59.515	7.439	46.75	98.5
24750	1.0	67.0	207.0	0.4672	16.7876	59.568	7.446	47.0	185.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0