Spaces:
Running
on
CPU Upgrade
Did you change the way you run evals today?
Hello, we saw our new model got evaluated yesterday:
https://huggingface.co./datasets/open-llm-leaderboard/results/blob/main/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/results_2023-08-09T10%3A54%3A28.159442.json
Now, it appears it has been re-evaluated in the last few hours:
https://huggingface.co./datasets/open-llm-leaderboard/results/blob/main/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/results_2023-08-09T19%3A53%3A44.921082.json
Everything is exactly the same in the config_general
section of the output, but the results are dramatically different (and the overall score much lower on the new eval which the leaderboard appears to have been updated to use).
We noted that the original eval had "total_evaluation_time_secondes": "6305.85427236557"
whereas the new eval appears to have taken over 5 times longer to run.
The original eval was much closer to our internal evals which were run based on your published methods. Can you please explain what has changed today?
Hi, I just checked the requests dataset, and your model has actually been submitted 3 times, one in float16, one in bfloat16, and one in 4bits (here). We ran the three evaluations, and I guess the last one (4 bit, which is way slower because of the quantization operations) overrode the other two in the results file.
However, it should be saying "4bit" there, so I'll check why it doesn't. Thank you very much for paying attention and pointing this out!
Thanks for that! We had a suspicion that might have been the case. None of these requests were submitted by our team. For reference, the model was trained in bfloat16, but the float16 results are similar at least.
Might I suggest something like this to avoid having evals with the wrong precision parameters pollute the results...
- Look in
config.json
for thetorch_dtype
entry and use that (e.g. https://huggingface.co./Open-Orca/OpenOrcaxOpenChat-Preview2-13B/blob/main/config.json#L19 ) - Look for a specific named file with nothing but the intended precision for evals (e.g. filename:
eval_dtype.cfg
contents:bfloat16
)
This would save evals time running with the wrong parameters and prevent spurious results being posted to the leaderboard (whether by ignorance, accident, or malice).
More weird stuff with average
Hi
@felixz
,
You'll notice if you display the model sha that this model appears 3 times because it's been submitted with three different model shas.
I would be grateful if you could create a dedicated issue next time.
@bleysg
We actually are OK with people submitting bfloat16
models for evaluation in 4bit
, for example, especially for bigger models: not everybody has the consumer hardware to run a 70B model, and it's very interesting for a lot of people to know what is the performance loss they get when quantizing models. That's why we added the option.
However, I updated the leaderboard, so that models of different precisions are now each on their own row, to avoid the problems you had earlier with your model, which is now back at the (almost) top of its category :)
Thanks! That works too :) I noticed that our model appears to be the only one aside from GPT2 with a 4bit result on the board currently. Is this perhaps due to a longstanding issue with 4bit evals getting miscategorized?
It is highly possible, yes!
I'll have to do a full pass on matching info in the requests file with info in the results file.