daiqi's picture
Update src/about.py
9f6b59d verified
raw
history blame
3.98 kB
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
task0 = Task("anli_r1", "acc", "ANLI")
task1 = Task("logiqa", "acc_norm", "LogiQA")
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">MageBench Leaderboard</h1>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
MageBench is a reasoning-oriented multimodal intelligent agent benchmark introduced in the paper ["MageBench: Bridging Large Multimodal Models to Agents"](https://arxiv.org/abs/2412.04531).
The tasks we selected meet the following criteria:
- Simple environment,
- Reflect a certain reasoning ability,
- High level of visual involvement.
In our paper, we demonstrate that our benchmark can generalize well to other scenarios.
We hope our work can empower future research in the fields of intelligent agents, robotics, and more.
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## How it works
This platform will not run your model for testing, it only provides a leaderboard.
You need to choose a preset that matches your results, test it in your local environment,
and then submit the results to us for approval. Once approved, we will make your results public.
## Reproducibility
Since we are unable to reproduce the submitter's results, to ensure the reliability of the results,
we require all submitters to provide either a link to a paper/blog/report that includes contact information or an open-source GitHub link that reproduces the results.
**Results that do not meet the above conditions or have other issues affecting fairness
(such as incorrect setting category) will be removed by us.**
"""
EVALUATION_QUEUE_TEXT = """
# Instructions to submit results
- First, make sure you've read the content in About part.
- Test you model locally and submit your results in the following form.
- Upload **one** result each time by fulfill the form and click "Upload One Eval", and you will be able to see the result in the "Uploaded results" part.
- Continue to upload untill all results are uploaded, click "Submit All", after restarting the space, you will be able to see your result on the leaderboard, but marked as checking.
- If your uploaded results contain error, click "Click Upload" and re-upload all results
- If there is an error in submitted results, you can upload an alternative, we will use the latest submitted results during our review.
- If there is an error in "checked" results, email us to withdraw.
# Detailed settings
- **Score**: float number, the corresponding evaluation number
- **Name**: str **less than 3 words**, an abbreviation representing your work, it can be a model name or paper key words.
- **BaseModel**: str, LMM model for agent, suggested to be the unique hf model id
- **Target-research**: (1)`Model-Eval-Online` and `Model-Eval-Global` represent the standard setting proposed in our paper, this setting is used to test the model capability. (2) `Agent-Eval-Prompt`: Any agent design that use fixed model weight, including using RAG, memory and etc. (3) `Agent-Eval-Finetune`: The model weight is changed, and it is trained on in-domain (same environment) data.
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@article{zhang2024magebench,
title={MageBench: Bridging Large Multimodal Models to Agents},
author={Miaosen Zhang and Qi Dai and Yifan Yang and Jianmin Bao and Dongdong Chen and Kai Qiu and Chong Luo and Xin Geng and Baining Guo},
journal={arXiv preprint arXiv:2412.04531},
year={2024}
}
"""