model RM how?

#2
by NickyNicky - opened

Hello, thank you for the model.
I am interested in knowing more about the training of this model.
I trained the 'Qwen2-0.5B-GRPO' model and saw that it uses the gemma model for rewards.
21 steps

https://huggingface.co./NickyNicky/Qwen2-0.5B-GRPO_final

How did you train him and what were your scoring criteria for the prompts?
Is it possible to obtain the colab as you trained it?
Is it possible to obtain the datasets?

thank you so much.

Sign up or log in to comment