What's your benchmark settings for DeepSeek-R1-Distill-Qwen-32B??
#2
by
AaronFeng753
- opened
For the results in the table, we used the SkyT1 evaluation code so we could compare directly with their model.
https://github.com/NovaSky-AI/SkyThought/tree/main/skythought/tools#generation-and-evaluation
You can find the full settings in their codebase.
However, note that there are some issues with the SkyT1 evaluation https://github.com/NovaSky-AI/SkyThought/issues/38
Therefore, I would suggest looking into the evaluation framework we are using now: https://github.com/mlfoundations/Evalchemy
We have a blog post that compares scores that we get with Evalchemy with scores that are publicly reported by the model developer: https://www.open-thoughts.ai/blog/measure