Spaces:
Running
on
CPU Upgrade
Temporary removal of models for investigation
Hey,
I have submitted a series of high scoring 7B models recently and upon feedback from users there seems to be some issues with them.
If it is possible, I would like to ask for temporary suspending their scores from occupying the top of the leaderboard until further investigation confirms that there were no benchmarks contamination or overfitting on my side.
Thank you for your time!
Hi @yam-peleg !
Thank you for paying attention to these feedbacks :)
No problem for the removal from the display, can you follow the steps in the FAQ and provide us with the link to all the relevant request files?
I deleted my examples showing how this and other Mistrals flooding the top of the leaderboard are nowhere close to earning scores of around 77, or even 70, because the battle is lost and I don't see a solution.
The amount of deliberate/unintentional contamination, and fine-tuning tricks to exploit the weaknesses of standardized tests, such as fine-tuning extreme stubbornness to increase TruthfulQA scores, has resulted in a needle in the haystack situation.
Mistral scores no longer remotely correlate with true performance. The top scoring Mistral 7b I've found (Trinity v1), scores ~67 relative to the scores of other more powerful LLMs on the leaderboard, but now they're hitting 77.
That's notably higher than Mixtral, GPT3.5 and other LLMs that do a far better job responding to my complex and diverse prompts, solving problems and so on. The top Mistrals are now dim-witted excessively stubborn empty shells created using fine-tuning tricks.
Hi @Phil337 ,
Thanks for your comments.
This highlights precisely why it's important to
- evaluate pretrained models in their own category
- have a separate view for moerges/merges
Note: I think it would be interesting next time to open an other discussion with your points, as this one is a "support" issue rather than a "discussion" issue. Edit: saw that you just did, looking forward to the discussion!
@yam-peleg Sorry to drop in here again, but can you read my last post in the following discussion. Does my theory make sense?
In summary, Experiment26 only becomes stubborn in mistake ridden areas, like solving logic problems, celebrity gossip, and who sang what song. This forces GPT4 to make corrections (synthetic data) that the far less knowledgeable and smaller Mistral 7b can't implement, resulting in a reliable behavior of egregious factual and logical errors (he was never married in that TV show), followed by fabricated substantiation (it was only part of fan fiction).
https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard/discussions/626
Hi
@yam-peleg
,
Closing this issue for inactivity, but feel free to reopen once you provide us with the relevant request files for your models as requested above!
Hey,
I have submitted a series of high scoring 7B models recently and upon feedback from users there seems to be some issues with them.
If it is possible, I would like to ask for temporary suspending their scores from occupying the top of the leaderboard until further investigation confirms that there were no benchmarks contamination or overfitting on my side.Thank you for your time!
Did the investigation conclude there was no issue with these experimental 7B models? There are quite a lot, and now we have other models on top of them. (auto-merges and fine-tunes) - just wanted to make sure they are safe to use them in my merges and fine-tunes.