Basic question: How would you reproduce the training of this model?
Hi there, I have a somewhat basic question: I am interested in training my own reward model (probably just on top of Llama 3.1 due to resource constraints) on your INF-ORM-Preference-Magnitude-80K dataset with some additional conversations that are more domain-specific.
Would you mind providing a pointer either to your training code or to a library that would enable me to replicate this work? Like, would I be able to get similar results with RewardTrainer from trl or would I need some other custom training library?
Hi, we utilized the Megatron-LM for training and modified it to speed up training. This framework is private and internal for the company, but you can still use the open-source training framework. I believe this will not affect the final performance.
Thank you for the response!