metadata

title: AutoBench 1.0 Demo
emoji: 🐠
colorFrom: red
colorTo: yellow
sdk: streamlit
sdk_version: 1.42.2
app_file: app.py
pinned: false
license: mit
short_description: Collective-Model-As-Judge LLM Benchmark

AutoBench 1.0 Demo

This Space runs a Collective-Model-As-Judge LLM benchmark to compare different language models using Hugging Face's Inference API. This is a simplified version of Autobench 1.0 which relies on multiple inference providers to manage request load and a wider range of models (Anthropic, Grok, Nebius, OpenAI, Together AI, Vertex AI). For more advanced use, please refer to the Hugging Face AutoBench 1.0 Repository.

Features

Benchmark multiple models side by side (models evaluate models)
Test models across various topics and difficulty levels
Evaluate question quality and answer quality
Generate detailed performance reports

How to Use

Enter your Hugging Face API token (needed to access models)
Select the models you want to benchmark
Choose topics and number of iterations
Click "Start Benchmark"
View and download results when complete

How it works

On each iteration, the system:

generates a question prompt based on a random topic and difficulty level
randomly selects a model to generate the question
asks all models to rank the question (The question is accepted if it ranks above a threshold (3.5) and all ranks are above a set value (2) - alternatively step 3 is repeated)
asks all models to generate an answer
per each answer, asks all models to rank the answer (from 1 to 5) and an average rank is computed based on weights that are proportional each models' rank
computes a cumulative average rank per each model over all iterations

Models

The benchmark supports any model available through Hugging Face's Inference API, including:

Meta Llama models
Mistral models
Alibaba Qwen models
And many more!

Note

In order to properly follow real-time the process of question generation, question ranking, answer generation, and answer ranking, check the container logs (above to the right of the "running" button).
Running a full benchmark might take some time depending on the number of models and iterations. Make sure you have sufficient Hugging Face credits to run the benchmark, especially when employing numerous models for long iteration duration.

Get Involved!

AutoBench is a step towards more robust, scalable, and future-proof LLM evaluation. We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation!

Start from our blog post on Hugging Face: Escape the Benchmark Trap: AutoBench – the Collective-LLM-as-a-Judge System for Evaluating AI models (ASI-Ready!)
Explore the code and data: Hugging Face AutoBench 1.0 Repository
Try our Demo on Spaces: AutoBench 1.0 Demo
Read the detailed methodology: Detailed Methodology Document
Join the discussion: Hugging Face AutoBench Community Discussion
Contribute: Contribute: Help us by suggesting new topics, refining prompts, or enhancing the weighting algorithm—submit pull requests or issues via the Hugging Face Repo.

License

MIT License

Contact

Peter Kruger/eZecute