distily_experiments_loss_cakld

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

Peak GPU Memory: 16.3131 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		13.0728	11.6511					21.6338
0	0	180421.8594	182559.25	426.3911	153.7612	6.504	3.252	181425.2031
500	0.0808	25914.9355	58400.8789	9.3101	153.5119	6.514	3.257	246441.4688
1000	0.1616	18034.6934	51549.3594	7.7199	153.5062	6.514	3.257	268579.2188
1500	0.2424	16219.1631	49107.0469	6.9828	153.332	6.522	3.261	295905.5312
2000	0.3232	14691.3838	42907.7266	6.4286	153.9179	6.497	3.248	327841.9375
2500	0.4040	16166.2109	47578.2539	6.0867	153.7651	6.503	3.252	314603.125
3000	0.4848	15222.2021	45014.1914	5.7882	153.7316	6.505	3.252	333285.875
3500	0.5657	14355.2061	44631.9570	5.5960	154.0532	6.491	3.246	350936.6875
4000	0.6465	13183.8857	43155.7617	5.4694	154.1821	6.486	3.243	353883.125
4500	0.7273	12622.3330	41256.7539	5.5192	154.1834	6.486	3.243	351506.7812
5000	0.8081	12319.9580	40217.3828	5.4403	154.6126	6.468	3.234	345530.5
5500	0.8889	12254.1387	42295.5117	5.1549	154.6614	6.466	3.233	350950.0625
6000	0.9697	11398.9785	39517.2930	5.0769	154.199	6.485	3.243	359435.0625
6187	0.9999	12004.6143	41571.4961	5.0626	154.2547	6.483	3.241	304727.5625