metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 84.0
eval_frwikippl: 342.0
eval_zhwikippl: 217.0
eval_tinystoriesppl: 69.5
eval_loss: 0.6877
eval_runtime: 16.9969
eval_samples_per_second: 58.834
eval_steps_per_second: 7.354

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.7252 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	2473901162496.0	170424302305280.0	20.7680	17.0409	58.682	7.335	4060086272.0	71468255805440.0
1000	0.0404	334.0	1464.0	1.5419	17.0178	58.762	7.345	243.0	596.0
2000	0.0808	232.0	756.0	1.3235	16.9755	58.909	7.364	189.0	250.0
3000	0.1212	180.0	628.0	1.1620	16.9923	58.85	7.356	149.0	171.0
4000	0.1616	150.0	576.0	1.0434	16.9803	58.892	7.361	121.5	172.0
5000	0.2020	130.0	504.0	0.9520	17.0128	58.779	7.347	100.5	144.0
6000	0.2424	113.5	420.0	0.8702	17.0074	58.798	7.35	91.0	137.0
7000	0.2828	106.0	408.0	0.8100	16.9821	58.885	7.361	80.5	160.0
8000	0.3232	96.5	396.0	0.7421	16.9749	58.911	7.364	70.5	127.0
9000	0.3636	84.0	342.0	0.6877	16.9969	58.834	7.354	69.5	217.0
10000	0.4040	78.0	300.0	0.6467	16.9846	58.877	7.36	65.0	139.0
11000	0.4444	77.0	278.0	0.5957	16.9903	58.857	7.357	60.0	127.5
12000	0.4848	75.0	272.0	0.5789	16.9858	58.873	7.359	56.5	140.0
13000	0.5253	71.5	266.0	0.5525	16.9418	59.026	7.378	56.5	116.0
14000	0.5657	71.0	252.0	0.5416	17.088	58.521	7.315	53.75	132.0
15000	0.6061	68.0	221.0	0.5283	16.9524	58.989	7.374	50.25	112.5
16000	0.6465	70.0	244.0	0.5200	17.0495	58.653	7.332	52.5	109.5
17000	0.6869	67.0	225.0	0.5097	17.0223	58.747	7.343	51.5	109.0
18000	0.7273	71.0	239.0	0.5016	17.0519	58.644	7.331	49.5	150.0
19000	0.7677	68.0	212.0	0.4887	17.0831	58.537	7.317	51.25	98.0
20000	0.8081	65.0	211.0	0.4865	17.0098	58.789	7.349	49.0	101.5
21000	0.8485	64.5	217.0	0.4791	17.0253	58.736	7.342	47.5	142.0
22000	0.8889	66.5	230.0	0.4798	16.9954	58.839	7.355	48.5	147.0
23000	0.9293	62.5	212.0	0.4675	16.9835	58.881	7.36	45.5	134.0
24000	0.9697	63.5	220.0	0.4712	16.9973	58.833	7.354	47.0	138.0
24750	1.0	63.75	247.0	0.4679	17.0597	58.618	7.327	45.75	205.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0