What's your benchmark settings for DeepSeek-R1-Distill-Qwen-32B??

by AaronFeng753 - opened Jan 30

Jan 30

The LCB hard score you got is significantly higher than FuseAI, please disclose your benchmark settings for this, you seems found the best settings for this model, thank you so much! Please share the details as well, like sys prompt, temperature, top p&k, repeat penalty

ryanmarten

Bespoke Labs org Jan 31

For the results in the table, we used the SkyT1 evaluation code so we could compare directly with their model.

https://github.com/NovaSky-AI/SkyThought/tree/main/skythought/tools#generation-and-evaluation

You can find the full settings in their codebase.

ryanmarten

Bespoke Labs org Jan 31

•

edited Jan 31

However, note that there are some issues with the SkyT1 evaluation https://github.com/NovaSky-AI/SkyThought/issues/38

Therefore, I would suggest looking into the evaluation framework we are using now: https://github.com/mlfoundations/Evalchemy

We have a blog post that compares scores that we get with Evalchemy with scores that are publicly reported by the model developer: https://www.open-thoughts.ai/blog/measure

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment