--- library_name: transformers tags: [] --- # Model Card for Model ID **PPO-M** (PPO with Calibrated Reward Modeling) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. We calibrate the reward modeling process by augmenting the binary pairwise ranking dataset with explicit confidence scores, and encourages the reward model to align confidence levels with response quality. Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details. ## Model Details ### Model Description We train a calibrated reward model from [OpenRLHF/Llama-3-8b-rm-mixture](https://huggingface.co./OpenRLHF/Llama-3-8b-rm-mixture) on our [https://huggingface.co./datasets/HINT-lab/calibration_preference_mixture_final-v0.1) dataset. - **Developed by:** Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang - **Finetuned from model:** [OpenRLHF/Llama-3-8b-rm-mixture](https://huggingface.co./OpenRLHF/Llama-3-8b-rm-mixture) ### Model Sources [optional] - **Repository:** [Our repo](https://github.com/SeanLeng1/Reward-Calibration) - **Paper:** [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)