allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm
Text Generation
•
Updated
•
366
•
6
A suite of models trained using DPO and PPO across a wide variety (up to 14) of preference datasets. See https://arxiv.org/abs/2406.09279 for more!
Note Our overall best model, a 13B Tulu 2 model trained using PPO with a 70B reward model trained on UltraFeedback! We also release the value and reward models associated with this model - see the model card for details
Note The datasets used for training PPO, DPO, and reward models in our paper.
Note The prompt sets used during PPO training in our paper. Below, see all our PPO-trained models!
Note Below is our PPO data ablations.
Note Below is our DPO data ablations.
Note Below are our reward models!
Note Below are our value models.
Note Below is llama 3 models: