What's your benchmark settings for DeepSeek-R1-Distill-Qwen-32B??

#2
by AaronFeng753 - opened

The LCB hard score you got is significantly higher than FuseAI, please disclose your benchmark settings for this, you seems found the best settings for this model, thank you so much! Please share the details as well, like sys prompt, temperature, top p&k, repeat penalty

Snipaste_2025-01-30_15-08-44.png

Bespoke Labs org

For the results in the table, we used the SkyT1 evaluation code so we could compare directly with their model.

https://github.com/NovaSky-AI/SkyThought/tree/main/skythought/tools#generation-and-evaluation

You can find the full settings in their codebase.

Bespoke Labs org
edited 3 days ago

However, note that there are some issues with the SkyT1 evaluation https://github.com/NovaSky-AI/SkyThought/issues/38

Therefore, I would suggest looking into the evaluation framework we are using now: https://github.com/mlfoundations/Evalchemy

We have a blog post that compares scores that we get with Evalchemy with scores that are publicly reported by the model developer: https://www.open-thoughts.ai/blog/measure

Sign up or log in to comment