Spaces:

Duplicated from benediktstroebl/hal

agent-evals
/

core_leaderboard

Running

App Files Files Community

core_leaderboard / agent_submission.md

benediktstroebl's picture

benediktstroebl

Upload 5 files

1d41341 verified 7 months ago

|

1.28 kB

	To submit a new agent for evaluation, developers should only need to:

	1. Adhere to Standardized I/O Format: Ensure the agent run file complies with the benchmark-specific I/O format. Depending on HAL's implementation, this could involve:
	* Providing a specific entry point to the agent (e.g., a Python script or function)
	* Correctly handling instructions and the submission process. For example, in METR's Vivaria, this can mean supplying a main.py file as the entry point and managing instructions.txt and submission.txt files.

	2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
	* For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers.
	* Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities.
	* However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.

	3. Use our CLI to run evaluations and upload the results. The same CLI can also be used to rerun existing agent-benchmark pairs from the leaderboard.