OpenELM-1_1B-DPO-full-max-random-reward

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 194.2460
Rewards/chosen: -660.0
Rewards/rejected: -568.0
Rewards/accuracies: 0.4277
Rewards/margins: -89.5
Logps/rejected: -57344.0
Logps/chosen: -66560.0
Logits/rejected: 7.5
Logits/chosen: 7.0

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 16
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6914	0.1047	100	0.6983	-0.3262	-0.3223	0.4004	-0.0052	-320.0	-352.0	-9.4375	-9.8125
0.6914	0.2094	200	58.1418	-201.0	-173.0	0.4453	-28.0	-17664.0	-20480.0	0.2969	0.2178
0.6914	0.3141	300	97.2609	-330.0	-284.0	0.4258	-44.75	-28800.0	-33280.0	-0.3262	-0.3027
0.6914	0.4188	400	102.8539	-348.0	-300.0	0.4297	-47.75	-30464.0	-35328.0	-0.7656	-0.7461
0.6914	0.5236	500	108.8187	-368.0	-318.0	0.4277	-50.25	-32128.0	-37120.0	-0.2490	-0.2812
0.6914	0.6283	600	114.7604	-388.0	-336.0	0.4355	-53.0	-33792.0	-39168.0	1.25	1.1016
0.6914	0.7330	700	120.9475	-410.0	-354.0	0.4277	-55.75	-35584.0	-41216.0	2.3906	2.1875
0.6914	0.8377	800	127.3012	-432.0	-372.0	0.4336	-58.75	-37632.0	-43520.0	4.2812	3.9062
0.6914	0.9424	900	133.8314	-454.0	-392.0	0.4297	-62.0	-39424.0	-45824.0	4.4375	4.0938
0.6914	1.0471	1000	140.0195	-476.0	-410.0	0.4355	-64.5	-41472.0	-47872.0	5.875	5.4062
0.6914	1.1518	1100	146.3645	-496.0	-430.0	0.4316	-67.5	-43264.0	-49920.0	5.7812	5.375
0.6914	1.2565	1200	151.9910	-516.0	-446.0	0.4336	-70.0	-44800.0	-51968.0	6.375	5.9375
0.6914	1.3613	1300	157.8106	-536.0	-462.0	0.4297	-73.0	-46592.0	-54016.0	7.0	6.5
0.6914	1.4660	1400	163.0493	-552.0	-478.0	0.4316	-75.5	-48128.0	-55552.0	7.3438	6.8125
0.6914	1.5707	1500	168.1114	-572.0	-494.0	0.4277	-77.5	-49664.0	-57344.0	7.2812	6.75
0.6914	1.6754	1600	172.7765	-588.0	-506.0	0.4316	-80.0	-50944.0	-58880.0	7.0625	6.5938
0.6914	1.7801	1700	176.9677	-600.0	-520.0	0.4395	-81.5	-52224.0	-60416.0	7.4688	6.9375
0.6914	1.8848	1800	180.6313	-612.0	-532.0	0.4355	-83.0	-53248.0	-61696.0	7.7812	7.25
0.6914	1.9895	1900	183.7843	-624.0	-540.0	0.4258	-84.5	-54272.0	-62720.0	7.625	7.125
0.6914	2.0942	2000	186.4619	-632.0	-548.0	0.4277	-86.0	-55040.0	-63744.0	7.6562	7.125
0.6914	2.1990	2100	188.7695	-640.0	-552.0	0.4258	-87.0	-55808.0	-64512.0	7.5938	7.125
0.6914	2.3037	2200	190.4722	-648.0	-560.0	0.4355	-87.5	-56320.0	-65024.0	7.5625	7.0625
0.6914	2.4084	2300	191.8555	-652.0	-564.0	0.4258	-88.5	-56576.0	-65536.0	7.5	7.0312
0.6914	2.5131	2400	192.9321	-656.0	-564.0	0.4258	-89.0	-56832.0	-66048.0	7.4375	6.9688
0.6914	2.6178	2500	193.6570	-656.0	-568.0	0.4258	-89.0	-57088.0	-66048.0	7.4688	7.0
0.6914	2.7225	2600	193.9604	-660.0	-568.0	0.4238	-89.5	-57344.0	-66048.0	7.5312	7.0625
0.6914	2.8272	2700	194.1360	-660.0	-568.0	0.4258	-89.5	-57344.0	-66048.0	7.5	7.0312
0.6914	2.9319	2800	194.2460	-660.0	-568.0	0.4277	-89.5	-57344.0	-66560.0	7.5	7.0

Framework versions

Transformers 4.44.2
Pytorch 2.3.0
Datasets 2.21.0
Tokenizers 0.19.1

CharlesLi
/

OpenELM-1_1B-DPO-full-max-random-reward

OpenELM-1_1B-DPO-full-max-random-reward

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Evaluation results