llama-7b-SFT-qlora-eli5_DPO_ds_RM_contrast_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_ds_eli5_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6908	0.1	19	0.6536	-0.2975	-0.4466	0.6060	0.1491	-206.9322	-213.6386	1.1767	1.1980
0.6613	0.21	38	0.6391	-0.1759	-0.3858	0.6172	0.2099	-206.3239	-212.4229	1.1695	1.1930
0.6667	0.31	57	0.6297	-0.0287	-0.2656	0.6440	0.2369	-205.1224	-210.9511	1.1612	1.1863
0.6532	0.42	76	0.6271	-0.0915	-0.3376	0.6172	0.2461	-205.8420	-211.5791	1.1395	1.1612
0.6546	0.52	95	0.6235	-0.0575	-0.2906	0.6362	0.2331	-205.3723	-211.2390	1.1551	1.1781
0.6528	0.62	114	0.6231	-0.0939	-0.3382	0.6562	0.2443	-205.8482	-211.6033	1.1702	1.1932
0.646	0.73	133	0.6204	-0.1525	-0.4204	0.6518	0.2678	-206.6696	-212.1891	1.1664	1.1886
0.6524	0.83	152	0.6208	-0.1083	-0.3660	0.6607	0.2577	-206.1257	-211.7465	1.1548	1.1765
0.6335	0.94	171	0.6204	-0.0937	-0.3490	0.6641	0.2553	-205.9560	-211.6011	1.1663	1.1890