metadata

base_model: Qwen/Qwen2-0.5B-Instruct
library_name: distily
license: apache-2.0
tags:
  - generated_from_trainer
model-index:
  - name: distily_experiments_loss_cakld
    results: []

distily_experiments_loss_cakld

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 12004.6143
eval_frwikippl: 41571.4961
eval_zhwikippl: 304727.5625
eval_loss: 5.0626
eval_runtime: 154.2547
eval_samples_per_second: 6.483
eval_steps_per_second: 3.241

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_strategy: logits_activations
loss_fn: cakld
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 16.3131 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		13.0728	11.6511					21.6338
0	0	180421.8594	182559.25	426.3911	153.7612	6.504	3.252	181425.2031
500	0.0808	25914.9355	58400.8789	9.3101	153.5119	6.514	3.257	246441.4688
1000	0.1616	18034.6934	51549.3594	7.7199	153.5062	6.514	3.257	268579.2188
1500	0.2424	16219.1631	49107.0469	6.9828	153.332	6.522	3.261	295905.5312
2000	0.3232	14691.3838	42907.7266	6.4286	153.9179	6.497	3.248	327841.9375
2500	0.4040	16166.2109	47578.2539	6.0867	153.7651	6.503	3.252	314603.125
3000	0.4848	15222.2021	45014.1914	5.7882	153.7316	6.505	3.252	333285.875
3500	0.5657	14355.2061	44631.9570	5.5960	154.0532	6.491	3.246	350936.6875
4000	0.6465	13183.8857	43155.7617	5.4694	154.1821	6.486	3.243	353883.125
4500	0.7273	12622.3330	41256.7539	5.5192	154.1834	6.486	3.243	351506.7812
5000	0.8081	12319.9580	40217.3828	5.4403	154.6126	6.468	3.234	345530.5
5500	0.8889	12254.1387	42295.5117	5.1549	154.6614	6.466	3.233	350950.0625
6000	0.9697	11398.9785	39517.2930	5.0769	154.199	6.485	3.243	359435.0625
6187	0.9999	12004.6143	41571.4961	5.0626	154.2547	6.483	3.241	304727.5625

Framework versions

Distily 0.1.0
Transformers 4.43.3
Pytorch 2.3.0
Datasets 2.20.0