Question: same model with very different scores

#904
by Yuma42 - opened

Hello, the leaderboard has the same model twice (they both link to the same model page). But the scores are very different. It's mlabonne/NeuralDaredevil-8B-abliterated which scores 27.01 and 21.5

Can someone explain? If I would have to guess maybe it's once with chat template and once without?

The only difference I can see is one's bfloat and the other is float16. My guess is there's a bug in the evaluation of IFeval with bfloat (41 vs 75) since the other evals match up.

Open LLM Leaderboard org

Hi @Yuma42 ,

This means that the model was estimated twice, in bf16 and in f16 precisions, so @phil111 is right. Please, checkout my screenshot where I clicked to show "Precision". Considering low IFEval, it isn't a bug. This model doesn't use the chat template in bfloat16 precision, causing low IFEval score, but as you can see in the request file, float16 version has "use_chat_template: True"

I close this discussion, please, feel free to open a new one in case of any questions!

Screenshot 2024-08-30 at 13.32.07.png

alozowski changed discussion status to closed

Sign up or log in to comment