Edit model card

zephyr-7b-dpo-lora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 1.5622
  • Rewards/chosen: -17.0278
  • Rewards/rejected: -19.8457
  • Rewards/accuracies: 0.6500
  • Rewards/margins: 2.8179
  • Logps/rejected: -2233.0220
  • Logps/chosen: -1971.0188
  • Logits/rejected: -1.7584
  • Logits/chosen: -1.7819

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 20

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6861 0.992 62 0.6889 0.0006 -0.0081 0.6450 0.0087 -249.2549 -268.1721 -2.8525 -2.8868
0.6612 2.0 125 0.6531 -0.0006 -0.0929 0.6550 0.0923 -257.7365 -268.2944 -2.8301 -2.8604
0.498 2.992 187 0.6181 -0.5345 -0.7907 0.6950 0.2561 -327.5125 -321.6882 -2.7597 -2.7831
0.2445 4.0 250 0.6131 -1.7835 -2.3662 0.6800 0.5827 -485.0650 -446.5857 -2.0948 -2.1082
0.1749 4.992 312 0.6447 -2.2836 -2.9706 0.6650 0.6870 -545.5024 -496.5903 -1.8275 -1.8365
0.0492 6.0 375 0.9374 -7.0520 -8.3385 0.6400 1.2865 -1082.3002 -973.4358 -1.3453 -1.3454
0.0064 6.992 437 0.8928 -8.6496 -10.2275 0.6450 1.5779 -1271.1948 -1133.1906 -1.5853 -1.5996
0.01 8.0 500 1.2673 -13.8405 -16.0886 0.6400 2.2482 -1857.3101 -1652.2802 -1.6448 -1.6610
0.0007 8.992 562 1.1752 -11.4716 -13.4777 0.6300 2.0061 -1596.2178 -1415.3928 -1.8498 -1.8705
0.0002 10.0 625 1.3088 -13.5264 -15.8880 0.6350 2.3616 -1837.2434 -1620.8707 -1.8164 -1.8397
0.0003 10.992 687 1.3563 -15.6686 -18.2912 0.6700 2.6225 -2077.5627 -1835.0981 -1.7419 -1.7643
0.0001 12.0 750 1.4799 -16.0123 -18.6412 0.6400 2.6289 -2112.5684 -1869.4608 -1.7532 -1.7747
0.0 12.992 812 1.4863 -15.9107 -18.5614 0.6450 2.6507 -2104.5852 -1859.3058 -1.7792 -1.8020
0.0003 14.0 875 1.5278 -16.6140 -19.3716 0.6500 2.7576 -2185.6045 -1929.6328 -1.7600 -1.7826
0.0438 14.992 937 1.5387 -16.7605 -19.5376 0.6500 2.7771 -2202.2078 -1944.2887 -1.7625 -1.7854
0.0001 16.0 1000 1.5438 -16.8482 -19.6450 0.6550 2.7968 -2212.9512 -1953.0580 -1.7596 -1.7831
0.0435 16.992 1062 1.5527 -16.9283 -19.7428 0.6500 2.8145 -2222.7285 -1961.0630 -1.7629 -1.7860
0.0001 18.0 1125 1.5617 -16.9933 -19.8065 0.6550 2.8133 -2229.1018 -1967.5621 -1.7580 -1.7814
0.0002 18.992 1187 1.5675 -17.0212 -19.8377 0.6550 2.8165 -2232.2144 -1970.3562 -1.7594 -1.7825
0.0001 19.84 1240 1.5622 -17.0278 -19.8457 0.6500 2.8179 -2233.0220 -1971.0188 -1.7584 -1.7819

Framework versions

  • PEFT 0.12.0
  • Transformers 4.44.0
  • Pytorch 2.4.0+cu121
  • Datasets 2.20.0
  • Tokenizers 0.19.1
Downloads last month
1
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for LaoRay/zephyr-7b-dpo-lora

Adapter
(137)
this model

Dataset used to train LaoRay/zephyr-7b-dpo-lora