metadata

license: apache-2.0
base_model: TheBloke/OpenHermes-2-Mistral-7B-GPTQ
tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: mistral-dpo
    results: []

mistral-dpo

This model is a fine-tuned version of TheBloke/OpenHermes-2-Mistral-7B-GPTQ on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0012
Rewards/chosen: 23.3318
Rewards/rejected: -6.7489
Rewards/accuracies: 1.0
Rewards/margins: 30.0806
Logps/rejected: -87.3513
Logps/chosen: -337.4500
Logits/rejected: -1.2781
Logits/chosen: -1.6769

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2
training_steps: 50
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5937	0.01	10	0.3601	0.7833	-0.3607	0.9904	1.1441	-23.4698	-562.9344	-1.1371	-1.4401
0.0908	0.02	20	1.1245	8.8420	-2.7352	0.9615	11.5772	-47.2141	-482.3473	-1.1942	-1.5504
0.0683	0.03	30	0.2541	17.6490	-4.7403	0.9904	22.3893	-67.2654	-394.2778	-1.2341	-1.6426
0.0009	0.04	40	0.0015	22.5664	-5.9863	1.0	28.5527	-79.7251	-345.1035	-1.2763	-1.6781
0.0003	0.05	50	0.0012	23.3318	-6.7489	1.0	30.0806	-87.3513	-337.4500	-1.2781	-1.6769

Framework versions

Transformers 4.35.2
Pytorch 2.0.1+cu117
Datasets 2.15.0
Tokenizers 0.15.0