AtlaAI/judge-arena · Which models do you want to see on here?

kaikaidai

Atla org Nov 19, 2024

We started with the following models as we've seen them most commonly used in eval pipelines

OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo)
Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku)
Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B)
Alibaba (Qwen 2.5 Instruct Turbo 7B / 72B, Qwen 2 Instruct 72B)
Google (Gemma 2 9B / 27B)
Mistral (Instruct v0.3 7B, Instruct v0.1 7B)

What models would you be curious to see on here next?

CombinHorizon

Nov 20, 2024

•

edited Dec 18, 2024

What about these models:

microsoft ( Phi-3-medium-4k-instruct 14B )
Alibaba ( Qwen 2.5 32B, 14B ), they have EQbench scores closer to Qwen 2.5 72B than 7B
Upstage ( solar-pro-preview-instruct 22B)
Mistral ( Mistral-Large-Instruct-2407 123B )

(as reference for which models to choose) other than the some common benchmarks

here's one [benchmark] that is related to judging:

judgemark

But how are the judging scores extracted?, - by number, words or something else? (see https://arxiv.org/abs/2305.14975)

jhoareau

Nov 20, 2024

Gemini models.

davidberenstein1957

Nov 20, 2024

https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo

bergr7f

Nov 20, 2024

https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo

hey! great initiative :) Would love to see a small model like Flow-Judge-v0.1 here! Happy to support with the integration if needed.

kaikaidai

Atla org Nov 20, 2024

What about these models:

microsoft ( Phi-3-medium-4k-instruct 14B )

Alibaba ( Qwen 2.5 32B, 14B )

Upstage ( solar-pro-preview-instruct 22B)

Mistral ( Mistral-Large-Instruct-2407 123B )

(as reference for which models to choose) other than the some common benchmarks

open_llm_leaderboard

eqbench

here's one [benchmark] that is related to judging:

judgemark

But how are the judging scores extracted?, - by number, words or something else? (see https://arxiv.org/abs/2305.14975)

Good shouts! I'm curious to see how those Qwen models would do given that the 2.5 7B is doing pretty well. And those benchmarks are very interesting, evaluating writing quality is a seriously tough task...

The judge score and critique are extracted from a JSON output {"feedback": "", "result": } similar to the Lynx paper

kaikaidai

Atla org Nov 20, 2024

https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo

hey! great initiative :) Would love to see a small model like Flow-Judge-v0.1 here! Happy to support with the integration if needed.

👀 will add flow judge in our next update, I'm super excited to see how a dedicated 3.8B model does

ChuckMcSneed

Nov 20, 2024

Add Command-r and Command-r+, both old and new. They were the least positively biased in my experience.

bittersweet

Nov 21, 2024

•

edited Nov 21, 2024

What a great work! we are looking forward such an arena for Judge models!

How about add compassjudger series (https://github.com/open-compass/CompassJudger),
which reached top performance on
RewardBench(https://huggingface.co./spaces/allenai/reward-bench),
JudgerBench(https://huggingface.co./spaces/opencompass/judgerbench_leaderboard),
JudgeBench(https://huggingface.co./spaces/ScalerLab/JudgeBench) between generative models.
And also can be applied to many subjective evaluation datasets as judge model. For example in ArenaHard: https://github.com/lmarena/arena-hard-auto/issues/49

kaikaidai

Atla org Nov 28, 2024

New models live on Judge Arena!

Prometheus-7b-v2, Command-R, Command-R+ models are now in the race🚀

We’ll get other specialised judge models on here soon.

++
@bittersweet do you have an email address I can reach out to? I’ve tried to get in touch RE getting CompassJudger on here

bittersweet

Nov 29, 2024

New models live on Judge Arena!

Prometheus-7b-v2, Command-R, Command-R+ models are now in the race🚀

We’ll get other specialised judge models on here soon.

++
@bittersweet do you have an email address I can reach out to? I’ve tried to get in touch RE getting CompassJudger on here

Just use this: [email protected]

softclone

Dec 14, 2024

o1
Hermes3-405B (I think Lambda is still offering this for free)
Athene-v2
Deepseek-2.5
the OG GPT-4-0314
Grok-2

pszemraj

Dec 19, 2024

https://huggingface.co./meta-llama/Llama-3.3-70B-Instruct pls

kaikaidai

Atla org Jan 15

Llama 3.3 70B, QwQ 32B Preview, Flow-Judge are all now on Judge Arena!

bergr7f

Jan 16

@kaikaidai would be cool to have phi-4 on the leaderboard - It seems to be a strong judge based on our internal testing.