metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 560.0
eval_frwikippl: 644.0
eval_zhwikippl: 488.0
eval_tinystoriesppl: 284.0
eval_loss: 0.6086
eval_runtime: 16.7587
eval_samples_per_second: 59.67
eval_steps_per_second: 7.459

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.4226 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1408749273088.0	96207267430400.0	20.4380	16.6447	60.079	7.51	7482638336.0	43430709297152.0
1000	0.0404	1408.0	1432.0	0.8546	16.7128	59.835	7.479	788.0	1056.0
2000	0.0808	988.0	928.0	0.7631	16.6827	59.942	7.493	520.0	302.0
3000	0.1212	836.0	760.0	0.7155	16.633	60.121	7.515	402.0	196.0
4000	0.1616	732.0	676.0	0.6800	16.6836	59.939	7.492	378.0	157.0
5000	0.2020	676.0	668.0	0.6574	16.6514	60.055	7.507	322.0	227.0
6000	0.2424	648.0	732.0	0.6383	16.6833	59.94	7.493	286.0	190.0
7000	0.2828	612.0	632.0	0.6373	16.8106	59.486	7.436	286.0	169.0
8000	0.3232	588.0	704.0	0.6243	16.6588	60.028	7.504	266.0	596.0
9000	0.3636	560.0	644.0	0.6086	16.7587	59.67	7.459	284.0	488.0
10000	0.4040	532.0	564.0	0.5994	16.6696	59.989	7.499	256.0	142.0
11000	0.4444	544.0	628.0	0.5916	16.7004	59.879	7.485	252.0	153.0
12000	0.4848	540.0	612.0	0.5828	16.7602	59.665	7.458	252.0	568.0
13000	0.5253	528.0	612.0	0.5735	16.6596	60.025	7.503	260.0	160.0
14000	0.5657	528.0	576.0	0.5628	16.7207	59.806	7.476	246.0	250.0
15000	0.6061	478.0	524.0	0.5511	16.736	59.752	7.469	232.0	170.0
16000	0.6465	442.0	552.0	0.5270	16.7225	59.8	7.475	228.0	214.0
17000	0.6869	420.0	524.0	0.4692	16.6506	60.058	7.507	212.0	174.0
18000	0.7273	384.0	478.0	0.4115	16.7225	59.8	7.475	208.0	144.0
19000	0.7677	362.0	400.0	0.3610	16.6691	59.991	7.499	195.0	128.0
20000	0.8081	344.0	346.0	0.3370	16.6695	59.99	7.499	184.0	107.5
21000	0.8485	306.0	302.0	0.3061	16.7054	59.861	7.483	161.0	110.5
22000	0.8889	300.0	318.0	0.2974	16.6709	59.985	7.498	160.0	84.0
23000	0.9293	290.0	298.0	0.2890	16.7049	59.863	7.483	162.0	103.0
24000	0.9697	300.0	290.0	0.2970	16.6771	59.963	7.495	164.0	85.5
24750	1.0	280.0	290.0	0.2782	16.74	59.737	7.467	162.0	91.5

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0