dpo

This model is a fine-tuned version of unsloth/llama-3-8b-Instruct-bnb-4bit on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.6257
Rewards/chosen: 0.8141
Rewards/rejected: 0.4945
Rewards/accuracies: 0.6431
Rewards/margins: 0.3196
Logps/rejected: -229.7856
Logps/chosen: -249.2073
Logits/rejected: -0.6789
Logits/chosen: -0.6135

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 4
seed: 0
gradient_accumulation_steps: 8
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 750
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6904	0.0372	28	0.6811	0.2766	0.2476	0.5770	0.0290	-232.2545	-254.5816	-0.5471	-0.5010
0.6591	0.0745	56	0.6623	0.9939	0.8694	0.5927	0.1245	-226.0365	-247.4085	-0.5351	-0.4798
0.6297	0.1117	84	0.6542	1.1966	0.9862	0.6136	0.2104	-224.8689	-245.3818	-0.4689	-0.4120
0.5985	0.1489	112	0.6540	1.5211	1.2525	0.6087	0.2687	-222.2059	-242.1367	-0.4989	-0.4262
0.6603	0.1862	140	0.6459	0.7737	0.5130	0.6304	0.2607	-229.6009	-249.6110	-0.5779	-0.5054
0.619	0.2234	168	0.6411	0.9352	0.6917	0.6222	0.2435	-227.8137	-247.9963	-0.5842	-0.5261
0.6497	0.2606	196	0.6427	0.8696	0.6404	0.6282	0.2292	-228.3268	-248.6518	-0.5798	-0.5255
0.6014	0.2979	224	0.6397	0.8941	0.6357	0.6263	0.2583	-228.3730	-248.4069	-0.6397	-0.5816
0.594	0.3351	252	0.6361	0.7069	0.4027	0.6319	0.3043	-230.7038	-250.2785	-0.6434	-0.5848
0.5898	0.3723	280	0.6356	1.0373	0.7462	0.6278	0.2911	-227.2686	-246.9745	-0.6340	-0.5714
0.639	0.4096	308	0.6342	0.7199	0.4321	0.6342	0.2878	-230.4095	-250.1490	-0.6956	-0.6293
0.6289	0.4468	336	0.6363	0.4299	0.1879	0.6248	0.2420	-232.8515	-253.0488	-0.6705	-0.6155
0.6304	0.4840	364	0.6321	0.7719	0.5053	0.6435	0.2667	-229.6779	-249.6284	-0.6279	-0.5652
0.6126	0.5213	392	0.6325	0.5194	0.2033	0.6375	0.3161	-232.6973	-252.1539	-0.6785	-0.6117
0.5974	0.5585	420	0.6254	0.7418	0.4269	0.6428	0.3149	-230.4618	-249.9303	-0.6823	-0.6170
0.6185	0.5957	448	0.6267	0.9534	0.6106	0.6409	0.3428	-228.6247	-247.8141	-0.6532	-0.5866
0.604	0.6330	476	0.6284	0.8011	0.4691	0.6394	0.3320	-230.0398	-249.3374	-0.6842	-0.6177
0.6154	0.6702	504	0.6269	0.8353	0.5307	0.6431	0.3046	-229.4234	-248.9947	-0.6705	-0.6051
0.5936	0.7074	532	0.6277	0.7287	0.4206	0.6469	0.3082	-230.5248	-250.0604	-0.6887	-0.6226
0.6291	0.7447	560	0.6260	0.8539	0.5327	0.6439	0.3211	-229.4030	-248.8091	-0.6758	-0.6096
0.6169	0.7819	588	0.6255	0.8797	0.5669	0.6461	0.3127	-229.0613	-248.5513	-0.6690	-0.6041
0.5934	0.8191	616	0.6256	0.8582	0.5399	0.6461	0.3183	-229.3312	-248.7658	-0.6753	-0.6095
0.6004	0.8564	644	0.6257	0.8263	0.5074	0.6450	0.3189	-229.6564	-249.0845	-0.6761	-0.6110
0.6282	0.8936	672	0.6256	0.8133	0.4949	0.6442	0.3184	-229.7819	-249.2152	-0.6748	-0.6101
0.5572	0.9309	700	0.6258	0.8122	0.4938	0.6442	0.3184	-229.7925	-249.2255	-0.6781	-0.6129
0.595	0.9681	728	0.6256	0.8140	0.4943	0.6428	0.3197	-229.7873	-249.2078	-0.6788	-0.6134

Framework versions

PEFT 0.11.1
Transformers 4.41.2
Pytorch 2.3.0+cu121
Datasets 2.19.2
Tokenizers 0.19.1

narekvslife
/

quantized

dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for narekvslife/quantized

Evaluation results