Escape the Benchmark Trap: AutoBench – the Collective-LLM-as-a-Judge System for Evaluating AI models (ASI-Ready!)
What is more powerful: Artificial Intelligence or Collective Intelligence? How about Collective Artificial Intelligence? Introducing AutoBench, a “Collective-LLM-as-a-judge” benchmark where, given a number of LLMs to be ranked, the same LLMs rank each other collectively. Run a demo now on Autobench 1.0 Demo Hugging Face Spaces.
LLM benchmarks are crucial for progress, but they're often expensive, static, and quickly become outdated. Human evaluations are slow and subjective, while existing automated benchmarks can be gamed. We need a better way to evaluate LLMs – one that's dynamic, cost-effective, and keeps pace with rapid advancements. That's why we built AutoBench.
AutoBench is a novel, fully automated LLM benchmark that uses the models themselves as judges. This Collective-LLM-as-a-Judge approach allows for continuous, scalable, and surprisingly affordable evaluation, achieving around 80% correlation with established standards like Chatbot Arena, MMLU, and the Artificial Analysis Quality Index (AAQI).
In this approach, a group of LLMs collaboratively evaluates questions and answers, leveraging their collective judgment to provide a robust, automated assessment of model performance, generating results that correlate strongly with established benchmarks. In addition, by dynamically generating questions (which are also ranked collectively), the system features strong resistance to "benchmark gaming".
Crucially, AutoBench is designed to be "ASI-ready". As LLMs evolve beyond human evaluation capabilities, AutoBench's Collective-LLM-as-a-Judge design will continue to provide relevant and meaningful comparisons.
Key Features and Advantages of AutoBench:
- Dynamic and Adaptive: Continuously generated questions prevent benchmark gaming and keep the evaluation relevant as LLMs evolve.
- Cost-Effective: A run with 20 models and costing under $100 —compared to thousands or even tens of thousands for human-based benchmarks—will return a stable benchmark, featuring a remarkably high correlation with most common benchmarks such as MMLU (correlation above 75%) or ChatBot Arena (correlation above 80%), making it significantly more affordable than human-based evaluations.
- Scalable: The Collective-LLM-as-a-Judge approach allows for easy evaluation of a large and updated number of models (this also by adopting a dynamic weighting system that considers each model's performance).
- Transparent Bias: While we acknowledge that bias is to some extent intrinsic to any benchmarking system, AutoBench embraces model bias as a feature, as the collective approach reduces individual model quirks and provides a perspective that reflects the "collective intelligence" of the current LLM ecosystem.
- ASI-Ready: As Artificial Superintelligence (ASI) emerges, AutoBench's reliance on LLMs as judges ensures it can scale with increasingly advanced models, maintaining relevance even when human evaluation becomes impractical.
- Nuanced Insight: AutoBench provides a nuanced understanding of LLM performance across a wide range of topics, from math and logic to history and creative writing. Its dynamic difficulty levels ensure fair and consistent evaluation.
How Does it Perform?
AutoBench achieves impressive correlations with widely recognized generalist LLM benchmarks:
- 83% correlation with Chatbot Arena: Indicating strong alignment with human preference-based evaluations of conversational ability.
- 75% correlation with MMLU: Demonstrating a significant correlation with a benchmark focused on massive multitask language understanding.
- 79% correlation with Artificial Analysis Intelligence Index (AAQI): Showing alignment with a benchmark assessing broader AI capabilities.
These are the results for our longest run: 20 models, 267 questions, 5340 answers, 112k rankings, all in under $100 and a 7-hour run. In table below, comparison is provided for ChatBot Arena (CBA), Measuring Massive Multitask Language Understanding (MMLU), Artificial Analysis Intelligence Index (AAQI). At the bottom of the table are reported the correlations between AutoBench 1.0 and the other benchmarks.
Model | AB score | CBA score | MMLU score | AAQI score |
---|---|---|---|---|
gpt-4o-2024-11-20 | 4.43 | 1365 | 86 | 75 |
gpt-4o-mini-2024-07-18 | 4.28 | 1273 | 82 | 73 |
gemini-2.0-flash-001 | 4.37 | 1357 | ||
gemini-2.0-flash-lite-preview-02-05 | 4.29 | 1306 | 85 | 79 |
gemini-1.5-flash-002 | 4.26 | 1271 | 81 | 74 |
google/gemma-2-27b-it | 4.07 | 1220 | 77 | 61 |
google/gemma-2-9b-it | 4.01 | 1192 | 73 | 55 |
meta-llama/Llama-3.3-70B-Instruct-Turbo | 4.25 | 1256 | 86 | 74 |
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo | 4.14 | 1248 | 84 | 67 |
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo-128K | 3.78 | 1176 | 71 | 54 |
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | 4.36 | 1269 | 86 | 72 |
deepseek-ai/DeepSeek-V3 | 4.27 | 1317 | 87 | 79 |
deepseek-ai/deepseek-llm-67b-chat | 3.94 | 1077 | 72 | 47 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 4.04 | 1114 | 63 | 41 |
mistralai/Mixtral-8x22B-Instruct-v0.1 | 4.11 | 1148 | 76 | 61 |
Qwen/Qwen2.5-72B-Instruct-Turbo | 4.33 | 1257 | 86 | 77 |
Qwen/Qwen2-VL-72B-Instruct | 4 | 1187 | 83 | 68 |
claude-3-haiku-20240307 | 4.09 | 1179 | 71 | 55 |
claude-3-5-haiku-20241022 | 4.25 | 1236 | 81 | 68 |
openai-gpt-3.5-turbo-0613 | 3.68 | 1117 | ||
correlation vs. AutoBench 1.0 | 83.14% | 75.09% | 79.19% |
And, of course, benchmark rankings can be also drawn per topic (math, logics, coding, history, science, creative writing, etc.).
AutoBench also provides detailed information about answer speeds. The graph below plots average rank against answer time per model, revealing efficiency trade-offs—faster models don't always rank highest. Note that DeepSeek V3's speed varied due to provider demand spikes during this run, highlighting real-world testing challenges (we tested several API providers).
Ready to Test Your Own Model?
AutoBench is designed to be easily extensible. You can add your own LLM in just a few steps:
- Load the pre-computed ranks and weights from our latest run.
- Add your model to the
configs
file, specifying the correct API provider (Anthropic, Grok, Nebius, OpenAI, Together AI, or Google's Vertex AI). - Run AutoBench for at least 100 iterations to get reliable results.
By then, rank error should be well below 1%. The current general config should be sufficiently robust to accommodate most cases (i.e. context windows are set long enough, but not too long to cost too much).
Detailed instructions and the full codebase are available on Hugging Face AutoBench Repository.
Available in demo on Hugging Face Spaces
Want to try it right away? Head to our AutoBench 1.0 Demo Space on Hugging Face. This is a scaled down demo to operate on the Hugging Face inference API (only 7 models available). All you have to do is select the models you want to rank and the topics you want the models to be ranked against. Then press "Run Benchmark.
While AutoBench excels in scalability and cost, its reliance on LLM judges introduces biases reflective of current models. We're actively refining this through community input and future updates. For more on this, please read the Detailed Methodology Document.
Get Involved!
AutoBench is a step towards more robust, scalable, and future-proof LLM evaluation. We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation!
- Explore the code and data: Hugging Face AutoBench 1.0 Repository
- Try our Demo on Spaces: AutoBench 1.0 Demo
- Read the detailed methodology: Detailed Methodology Document
- Join the discussion: Hugging Face AutoBench Community Discussion
- Contribute: Contribute: Help us by suggesting new topics, refining prompts, or enhancing the weighting algorithm—submit pull requests or issues via the Hugging Face Repo.