Spaces:

mlabonne
/

Yet_Another_LLM_Leaderboard

Running

Nice Leaderboard :)

by Venkman42 - opened Jan 9

Jan 9

Nice to see some more leaderboards. I feel like you can't really trust the open llm leaderboard at this point and they don't add any phi-2 models except the Microsoft one because of remote code.
Could you add the following models?
Phi-2-dpo
Openchat v1+v2
Mistral v1+v2
Starling Alpha

I would be really interested how phi-2-dpo stacks up against dolphin phi and the other models would be great reference points, since they have been very popular

mlabonne

Owner Jan 9

Thanks! I added Starling Alpha and Openchat thanks to @gblazex working on uploading more phi-2 models.

Venkman42

Jan 9

Thanks, don't forget the new phixtral models 😉

Oh and btw, gobruins 2.1.1 was flagged as contaminated on the Open LLM Leaderboard because it contains Data for TruthfulQA
https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657e6a221e3e9c41a4a8ae23

mlabonne

Owner Jan 9

Haha phixtral is cooking! Performance won't be that great with this version but it's a start.

Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.

gblazex

Jan 10

I'll run these:
Openchat-1210 ("v2")
Mistral v1+v2

gblazex

Jan 10

•

edited Jan 10

Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.

yes TruthfulQA is part of Nous. I wanted to see how it does on the rest.
No need to be on this leaderboard

(This was the only flagged one that was interesting to me because relatively lot of people liked the model card. It actually does well on other benchmarks like AGIEval).

b-mc2

Jan 10

👍 Cool leaderboard!

I'm glad to see dolphin-2_6-phi-2 up here, it feels capable and it's cool to see it compared to phi-2. 3B models are pretty under represented on the Open LLM board with a big gap between phi-2 and MiniChat-2, so I'd also like to request stabilityai/stablelm-zephyr-3b. It isn't on the other board due to remote code, it doesn't feel as good as phi-2 but it is decent.

Also Weyaxi/OpenHermes-2.5-neural-chat-v3-3-Slerp is really good and my go-to 7B so I'd like to see how it performs here too.

Venkman42

Jan 10

New Openchat model just dropped, would be a great addition ;)
openchat/openchat-3.5-0106

64 hidden messages

Expand all

Venkman42

Jun 5

@Venkman42

@mlabonne Hey, could please you add Microsofts Phi-3-instruct?

I did this for you, results can be found here: https://huggingface.co./spaces/CultriX/Alt_LLM_LeaderBoard?logs=build

Or in more detail: https://gist.github.com/CultriX-Github/63952ac9317c80e241c0337c31e53a13

Thanks :)

CultriX

Jun 16

•

edited Jun 16

Sorry @Venkman42 I forgot to add it. Thanks @CultriX I'm forking you gist :)

Don't tell anyone but I think I forked about every single evaluation you ever ran on your leaderboard so I didn't have to run it myself. So all good :p!

Edit: Hey btw what happened to your leaderboard? It looks like all your scores are now way lower than they used to be (compare your board with mine). Did you change something? The tests run maybe? (Also you have quite a few models added double which is absolutely not a big deal except it ruins the pretty graphs at the bottom which for me personally actually is a little bit of a big deal but that's probably just me lol).

mlabonne

Owner Jun 16

Yeah I removed TruthfulQA from the Average score to make it more accurate. Where do you see duplicated models, is it in Automerger's version?

CultriX

Jun 16

No on this one. And ahh check that explains a lot! Any reason why you took that one out in particular?

mlabonne

Owner Jun 16

Ah it's because of the "Other" category, just fixed it thanks.

TruthfulQA is an awful benchmark, I believe HF also thinks of removing it.

CultriX

Jun 17

I liked your idea of being able to remove certain benchmarks or tests and have the table recalculate a new filtered average. But instead of making the choice for the user, I decided to implement a way so that people can chose which benchmarks they want to use in the table. Take a look and feel free to copy the idea if you like it :)!

https://huggingface.co./spaces/CultriX/Alt_LLM_LeaderBoard

mlabonne

Owner Jun 17

Yeah that's a good idea, might steal it haha

CultriX

Jun 17

•

edited Jun 17

might

edit: nvm I stole your entire leaderboard so I really can't say sh*t lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment