Edit model card

distily_experiments_loss_cakld

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 12004.6143
  • eval_frwikippl: 41571.4961
  • eval_zhwikippl: 304727.5625
  • eval_loss: 5.0626
  • eval_runtime: 154.2547
  • eval_samples_per_second: 6.483
  • eval_steps_per_second: 3.241

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_strategy: logits_activations
  • loss_fn: cakld
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 16.3131 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 13.0728 11.6511 21.6338
0 0 180421.8594 182559.25 426.3911 153.7612 6.504 3.252 181425.2031
500 0.0808 25914.9355 58400.8789 9.3101 153.5119 6.514 3.257 246441.4688
1000 0.1616 18034.6934 51549.3594 7.7199 153.5062 6.514 3.257 268579.2188
1500 0.2424 16219.1631 49107.0469 6.9828 153.332 6.522 3.261 295905.5312
2000 0.3232 14691.3838 42907.7266 6.4286 153.9179 6.497 3.248 327841.9375
2500 0.4040 16166.2109 47578.2539 6.0867 153.7651 6.503 3.252 314603.125
3000 0.4848 15222.2021 45014.1914 5.7882 153.7316 6.505 3.252 333285.875
3500 0.5657 14355.2061 44631.9570 5.5960 154.0532 6.491 3.246 350936.6875
4000 0.6465 13183.8857 43155.7617 5.4694 154.1821 6.486 3.243 353883.125
4500 0.7273 12622.3330 41256.7539 5.5192 154.1834 6.486 3.243 351506.7812
5000 0.8081 12319.9580 40217.3828 5.4403 154.6126 6.468 3.234 345530.5
5500 0.8889 12254.1387 42295.5117 5.1549 154.6614 6.466 3.233 350950.0625
6000 0.9697 11398.9785 39517.2930 5.0769 154.199 6.485 3.243 359435.0625
6187 0.9999 12004.6143 41571.4961 5.0626 154.2547 6.483 3.241 304727.5625

Framework versions

  • Distily 0.1.0
  • Transformers 4.43.3
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
0
Safetensors
Model size
494M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_experiments_loss_cakld

Base model

Qwen/Qwen2-0.5B
Quantized
this model