distily_experiments_loss_mse

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

Peak GPU Memory: 12.6346 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		13.0697	11.6518					21.6262
0	0	181673.8281	182246.2969	0.4342	81.4623	12.276	3.069	181831.9062
500	0.0808	51828.1172	61691.7969	0.0957	81.6929	12.241	3.06	207997.2812
1000	0.1616	45041.4180	54485.3594	0.0860	81.5919	12.256	3.064	209966.75
1500	0.2424	44320.2031	52669.4492	0.0833	81.6149	12.253	3.063	212294.125
2000	0.3232	41822.9727	48903.5156	0.0852	81.4039	12.284	3.071	223025.1094
2500	0.4040	41753.0312	49089.7227	0.0827	81.431	12.28	3.07	220010.75
3000	0.4848	42048.8906	49907.2539	0.0840	81.49	12.271	3.068	210042.8594
3500	0.5657	41577.9180	46600.7773	0.0849	81.4631	12.275	3.069	221650.6875
4000	0.6465	41242.4766	46538.6875	0.0824	81.561	12.261	3.065	211307.8125
4500	0.7273	41470.4414	46413.6094	0.0803	81.5887	12.257	3.064	221236.5469
5000	0.8081	42138.4922	45917.2109	0.0799	81.4339	12.28	3.07	207382.2812
5500	0.8889	41578.75	45413.4766	0.0818	81.4181	12.282	3.071	233708.25
6000	0.9697	42023.4766	45574.0469	0.0790	81.4797	12.273	3.068	208369.9375
6187	0.9999	42178.9805	45721.7109	0.0786	81.5551	12.262	3.065	198318.0