AutoBench
Organization Details
- Organization: AutoBench
- Point of Contact: Peter Kruger, CEO eZecute
- Website: [www.autobench.org]
- Funding: Self-funded, with the support of inference API providers providing free inference compute support.
Organization Description
AutoBench is an organization dedicated to advancing the evaluation of Large Language Models (LLMs) through innovative, automated benchmarking systems. Our flagship project is the AutoBench 1.0 benchmark, a novel system that utilizes a "Collective-LLM-as-a-Judge" approach. This approach leverages LLMs themselves to assess the quality of both questions and answers generated by other LLMs. AutoBench aims to address the limitations of traditional, static benchmarks by providing a dynamic, scalable, cost-effective, and less human-biased evaluation framework.
Benchmarking System: AutoBench 1.0
Overview
AutoBench 1.0 is a fully automated, iterative benchmark system for evaluating LLMs. It dynamically generates questions, assesses their quality, and ranks LLM-generated answers using a collective of LLMs as judges. This system is designed to be:
- Dynamic: Questions are generated on-the-fly for each iteration, reducing the risk of benchmark gaming.
- Scalable: The system is designed to handle a large number of models and can be easily scaled up.
- Cost-Effective: AutoBench 1.0 achieves high correlation with established benchmarks at a significantly lower cost than human-based evaluations (under $100 for a full run with 20 models).
- Less Human-Biased: While model bias exists, the "Collective-LLM-as-a-Judge" approach reduces reliance on subjective human judgment.
- Granular: Provides topic-specific performance insights, not just an aggregate score.
- Adaptive: Model weights are adjusted with use, meaning the benchmark improves in quality the more it is used.
Key Features
- Collective-LLM-as-a-Judge: Employs a group of LLMs to evaluate both the quality of generated questions and the answers provided by other LLMs.
- Dynamic Question Generation: Generates new questions in each iteration, covering a range of topics and difficulty levels.
- Iterative Evaluation: Runs for a predefined number of iterations to provide robust and statistically meaningful results.
- Model Weighting and Adaptation: Dynamically adjusts the influence of individual judging models based on their performance.
- Comprehensive Metrics: Provides overall average rank, topic-specific ranks, and correlations with established benchmarks (Chatbot Arena, MMLU, AAQI).
- Error Handling and Robustness: Includes mechanisms for handling API errors, unresponsive models, and invalid responses.
Intended Use
The AutoBench 1.0 benchmark is intended for:
- Researchers and developers working on LLMs.
- Organizations evaluating LLMs for deployment.
- Anyone interested in tracking the progress of LLM capabilities.
The benchmark provides a standardized, automated, and cost-effective way to assess the performance of LLMs across a variety of tasks and topics.
Ethical Considerations
AutoBench is committed to the responsible development and use of LLMs. We encourage users of the benchmark to consider the potential ethical implications of their work and to use the benchmark results responsibly. The limitations and biases of AutoBench 1.0 should be carefully considered when interpreting the results.
Inference cost Support
Running a compute intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits.
Citation
If you use AutoBench 1.0 in your research, please cite:
@misc{autobench2024,
title={AutoBench 1.0: A Collective-LLM-as-a-Judge Benchmark System},
author={AutoBench},
year={2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co./AutoBench}},
note = {Accessed: [Date Accessed]}
}
Learn more and contribute