distily_bench_obj_cross_v2.10

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 107.6398
eval_frwikippl: 10204.3643
eval_zhwikippl: 49954.8242
eval_tinystoriesppl: 6.6903
eval_loss: 0.7036
eval_runtime: 13.0602
eval_samples_per_second: 76.568
eval_steps_per_second: 9.571

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 6.6064 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	50480.5703	85684.4844	6.8305	13.0304	76.744	9.593	33932.0586	94692.1562
5000	0.0505	110.8554	10584.2598	0.7523	13.0416	76.677	9.585	6.7911	42034.9414
10000	0.1010	104.0690	10210.1172	0.7242	13.0341	76.722	9.59	6.4174	44683.2305
15000	0.1515	113.6466	10400.9941	0.7156	13.0171	76.822	9.603	7.2840	46906.4258
20000	0.2020	111.4970	9877.6748	0.7117	13.0184	76.814	9.602	7.1889	47931.1602
25000	0.2525	107.3317	10121.3330	0.7051	13.088	76.406	9.551	6.6947	49516.9375
30000	0.3030	107.4814	10147.0312	0.7042	13.0664	76.532	9.567	6.6925	49728.7578
35000	0.3535	107.5147	10109.9404	0.7041	13.0324	76.732	9.591	6.6794	49279.6914
40000	0.4040	107.5064	10121.3330	0.7041	13.1335	76.141	9.518	6.6994	49835.0078
45000	0.4545	107.3816	10129.8984	0.7039	13.1075	76.292	9.537	6.6972	49464.1211
50000	0.5051	107.5231	10129.8984	0.7040	13.0137	76.842	9.605	6.7041	49808.4492
55000	0.5556	107.7482	10135.5996	0.7040	13.0084	76.874	9.609	6.7052	49464.1211
60000	0.6061	107.6064	10204.3643	0.7040	13.0291	76.751	9.594	6.6991	49914.8711
65000	0.6566	107.6981	10204.3643	0.7037	13.0479	76.641	9.58	6.6958	49543.3398
70000	0.7071	107.8484	10204.3643	0.7036	13.0612	76.563	9.57	6.6953	49848.3164
75000	0.7576	107.5897	10204.3643	0.7036	13.1821	75.86	9.483	6.6895	49888.2188
80000	0.8081	107.6398	10204.3643	0.7037	13.1572	76.004	9.5	6.6900	49835.0078
85000	0.8586	107.7148	10204.3643	0.7037	12.9936	76.961	9.62	6.6928	49928.1523
90000	0.9091	107.6398	10204.3643	0.7035	13.0225	76.79	9.599	6.6919	49954.8242
95000	0.9596	107.6398	10204.3643	0.7036	13.0696	76.514	9.564	6.6914	49954.8242
99000	1.0	107.6398	10204.3643	0.7036	13.0602	76.568	9.571	6.6903	49954.8242

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

lapp0
/

distily_bench_obj_cross_v2.10