Spaces:
Running
Running
Upload 3 files
Browse files- about.md +1 -14
- app.py +5 -409
- benchmark_submission.md +1 -1
about.md
CHANGED
@@ -47,17 +47,4 @@ We see HAL being useful for four categories of users:
|
|
47 |
1. Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
|
48 |
2. Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
|
49 |
3. Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
|
50 |
-
4. Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).
|
51 |
-
|
52 |
-
## Platform demo
|
53 |
-
|
54 |
-
The platform features a user-friendly frontend for accessing and interacting with the evaluation results generated with the evaluation harness.
|
55 |
-
|
56 |
-
- Public leaderboards: The public leaderboard displays agent rankings for each supported benchmark. While some metrics are benchmark-specific, others, for example, cost, are reported for each.
|
57 |
-
- Automatic Pareto frontiers: We automatically determine the convex hull of agents for each benchmark and visualize it in a scatter plot.
|
58 |
-
- Verified Results: We will prominently show which results we verified (by re-running them). We plan to periodically re-run the top 5 agents for each benchmark in order to always have a verified SOTA agent.
|
59 |
-
- Task completion heatmap: Of all tasks in the benchmark, how many were solved by which agent? How many were solved by at least one?
|
60 |
-
- Qualitative Failure Mode Analysis: In addition to the raw predictions, we also provide an LLM-based analysis of the recurring failure modes of each agent and plot the number of affected tasks for each identified failure mode.
|
61 |
-
- Agent Monitor: To aid visibility into which steps an agent took and compare the approaches of agents on the same task, we provide a visual overview of the steps an agent took. For each step, we include an LLM-generated summary of the action in light of the overall goal of a task.
|
62 |
-
- Raw Traces: All uploaded results are publicly accessible, and detailed information about each evaluation, including API parameters, token usage, and I/O, is made available.
|
63 |
-
- Submission Interface: We provide a submission interface for researchers to upload their evaluation results in a standardized way across all benchmarks supported by the platform to support easy integration of benchmark results for any given agent.
|
|
|
47 |
1. Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
|
48 |
2. Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
|
49 |
3. Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
|
50 |
+
4. Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
@@ -391,6 +391,7 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
|
|
391 |
.user-type-links a {
|
392 |
display: inline-block;
|
393 |
padding: 5px 12px;
|
|
|
394 |
background-color: #f0f4f8;
|
395 |
color: #2c3e50 !important; /* Force the color change */
|
396 |
text-decoration: none !important; /* Force remove underline */
|
@@ -1124,422 +1125,17 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
|
|
1124 |
outputs=[raw_step_dropdown, raw_call_details])
|
1125 |
raw_step_dropdown.change(update_raw_call_details,
|
1126 |
inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1127 |
-
outputs=[raw_call_details])
|
1128 |
-
|
1129 |
-
|
1130 |
-
# with gr.Tab("SWE-Bench Verified"):
|
1131 |
-
# gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Verified is a human-validated subset of 500 problems reviewed by software engineers. The We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
|
1132 |
-
# with gr.Row():
|
1133 |
-
# with gr.Column(scale=2):
|
1134 |
-
# Leaderboard(
|
1135 |
-
# value=parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified'),
|
1136 |
-
# select_columns=SelectColumns(
|
1137 |
-
# default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
|
1138 |
-
# cant_deselect=["Agent Name"],
|
1139 |
-
# label="Select Columns to Display:",
|
1140 |
-
# ),
|
1141 |
-
# hide_columns=config.SWEBENCH_HIDE_COLUMNS,
|
1142 |
-
# search_columns=config.SWEBENCH_SEARCH_COLUMNS
|
1143 |
-
# )
|
1144 |
-
# gr.Markdown("""*95% CIs calculated using Student's t-distribution.*""", elem_classes=["text-right"])
|
1145 |
-
# with gr.Row():
|
1146 |
-
# scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
|
1147 |
-
|
1148 |
-
# gr.Markdown("")
|
1149 |
-
# gr.Markdown("")
|
1150 |
-
# gr.Markdown("## Task success heatmap")
|
1151 |
-
# with gr.Row():
|
1152 |
-
# task_success_heatmap = gr.Plot()
|
1153 |
-
# demo.load(
|
1154 |
-
# lambda: create_task_success_heatmap(
|
1155 |
-
# preprocessor.get_task_success_data('swebench_verified'),
|
1156 |
-
# 'SWEBench Verified'
|
1157 |
-
# ),
|
1158 |
-
# outputs=[task_success_heatmap]
|
1159 |
-
# )
|
1160 |
-
|
1161 |
-
# gr.Markdown("")
|
1162 |
-
# gr.Markdown("")
|
1163 |
-
# gr.Markdown("## Failure report for each agent")
|
1164 |
-
# with gr.Row():
|
1165 |
-
# with gr.Column(scale=1):
|
1166 |
-
# failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
|
1167 |
-
# with gr.Row():
|
1168 |
-
# with gr.Column(scale=1):
|
1169 |
-
# failure_categories_overview = gr.Markdown()
|
1170 |
-
|
1171 |
-
# with gr.Column(scale=1):
|
1172 |
-
# failure_categories_chart = gr.Plot()
|
1173 |
-
|
1174 |
-
# # Initialize the failure report agent dropdown with all agents
|
1175 |
-
# demo.load(update_agent_dropdown,
|
1176 |
-
# inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
|
1177 |
-
# outputs=[failure_report_agent_dropdown])
|
1178 |
-
|
1179 |
-
# # Update failure report when agent is selected
|
1180 |
-
# failure_report_agent_dropdown.change(update_failure_report,
|
1181 |
-
# inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_verified", visible=False)],
|
1182 |
-
# outputs=[failure_categories_overview, failure_categories_chart])
|
1183 |
-
|
1184 |
-
# gr.Markdown("")
|
1185 |
-
# gr.Markdown("")
|
1186 |
-
# gr.Markdown("## Agent monitor")
|
1187 |
-
# with gr.Row():
|
1188 |
-
# with gr.Column(scale=1):
|
1189 |
-
# agent_dropdown = gr.Dropdown(label="Select Agent")
|
1190 |
-
# with gr.Column(scale=1):
|
1191 |
-
# task_dropdown = gr.Dropdown(label="Select SWE-Bench Task")
|
1192 |
-
# with gr.Row():
|
1193 |
-
# task_overview = gr.Markdown()
|
1194 |
-
# with gr.Row():
|
1195 |
-
# flow_chart = gr.Plot(label="Task Flow")
|
1196 |
-
|
1197 |
-
# # Initialize the agent dropdown with the best agent
|
1198 |
-
# demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
|
1199 |
-
# demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
|
1200 |
-
|
1201 |
-
# agent_dropdown.change(update_task_analysis,
|
1202 |
-
# inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown],
|
1203 |
-
# outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
|
1204 |
-
# task_dropdown.change(update_task_details,
|
1205 |
-
# inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown, task_dropdown],
|
1206 |
-
# outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
|
1207 |
-
|
1208 |
-
# gr.Markdown("## Raw predictions")
|
1209 |
-
# with gr.Row():
|
1210 |
-
# with gr.Column(scale=1):
|
1211 |
-
# raw_agent_dropdown = gr.Dropdown(label="Select Agent")
|
1212 |
-
# with gr.Column(scale=1):
|
1213 |
-
# raw_task_dropdown = gr.Dropdown(label="Select Task")
|
1214 |
-
# with gr.Column(scale=1):
|
1215 |
-
# raw_step_dropdown = gr.Dropdown(label="Select Step")
|
1216 |
-
|
1217 |
-
# with gr.Row():
|
1218 |
-
# raw_call_details = gr.HTML()
|
1219 |
-
|
1220 |
-
# def update_raw_task_dropdown(agent_name):
|
1221 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
|
1222 |
-
# if not analyzed_traces:
|
1223 |
-
# return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
|
1224 |
-
# task_ids = list(analyzed_traces.keys())
|
1225 |
-
# steps = analyzed_traces[task_ids[0]]['steps']
|
1226 |
-
# return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), update_raw_call_details(agent_name, task_ids[0], 0)
|
1227 |
-
|
1228 |
-
# def update_raw_step_dropdown(agent_name, task_id):
|
1229 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
|
1230 |
-
# if not analyzed_traces or task_id not in analyzed_traces:
|
1231 |
-
# return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
|
1232 |
-
# steps = analyzed_traces[task_id]['steps']
|
1233 |
-
# return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
|
1234 |
-
|
1235 |
-
# def update_raw_call_details(agent_name, task_id, step_index):
|
1236 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
|
1237 |
-
# if not analyzed_traces or task_id not in analyzed_traces:
|
1238 |
-
# return "No data available for this selection."
|
1239 |
-
# steps = analyzed_traces[task_id]['steps']
|
1240 |
-
# if step_index is None:
|
1241 |
-
# return "Invalid step selection."
|
1242 |
-
# step = steps[step_index]
|
1243 |
-
# return format_call_info(step, step_index)
|
1244 |
-
|
1245 |
-
# # Initialize the raw agent dropdown with all agents
|
1246 |
-
# demo.load(update_agent_dropdown,
|
1247 |
-
# inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
|
1248 |
-
# outputs=[raw_agent_dropdown])
|
1249 |
-
# demo.load(update_raw_task_dropdown,
|
1250 |
-
# inputs=[raw_agent_dropdown],
|
1251 |
-
# outputs=[raw_task_dropdown, raw_step_dropdown])
|
1252 |
-
# demo.load(update_raw_call_details,
|
1253 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1254 |
-
# outputs=[raw_call_details])
|
1255 |
-
|
1256 |
-
# raw_agent_dropdown.change(update_raw_task_dropdown,
|
1257 |
-
# inputs=[raw_agent_dropdown],
|
1258 |
-
# outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
|
1259 |
-
# raw_task_dropdown.change(update_raw_step_dropdown,
|
1260 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown],
|
1261 |
-
# outputs=[raw_step_dropdown, raw_call_details])
|
1262 |
-
# raw_step_dropdown.change(update_raw_call_details,
|
1263 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1264 |
-
# outputs=[raw_call_details])
|
1265 |
-
|
1266 |
-
# with gr.Tab("SWE-Bench Lite"):
|
1267 |
-
# gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Lite is a subset of 300 tasks of the original SWE-bench. We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
|
1268 |
-
# with gr.Row():
|
1269 |
-
# with gr.Column(scale=2):
|
1270 |
-
# Leaderboard(
|
1271 |
-
# value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite'), ci_metrics=['Accuracy', 'Total Cost']),
|
1272 |
-
# select_columns=SelectColumns(
|
1273 |
-
# default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
|
1274 |
-
# cant_deselect=["Agent Name"],
|
1275 |
-
# label="Select Columns to Display:",
|
1276 |
-
# ),
|
1277 |
-
# search_columns=config.SWEBENCH_SEARCH_COLUMNS,
|
1278 |
-
# hide_columns=config.SWEBENCH_HIDE_COLUMNS
|
1279 |
-
# )
|
1280 |
-
# # make right aligned markdown
|
1281 |
-
# gr.Markdown("""*95% CIs calculated using Student's t-distribution.*""", elem_classes=["text-right"])
|
1282 |
-
# with gr.Row():
|
1283 |
-
# scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite', aggregate=True), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
|
1284 |
-
|
1285 |
-
# gr.Markdown("")
|
1286 |
-
# gr.Markdown("")
|
1287 |
-
# gr.Markdown("## Task success heatmap")
|
1288 |
-
# with gr.Row():
|
1289 |
-
# task_success_heatmap = gr.Plot()
|
1290 |
-
# demo.load(
|
1291 |
-
# lambda: create_task_success_heatmap(
|
1292 |
-
# preprocessor.get_task_success_data('swebench_lite'),
|
1293 |
-
# 'SWEBench Lite'
|
1294 |
-
# ),
|
1295 |
-
# outputs=[task_success_heatmap]
|
1296 |
-
# )
|
1297 |
-
|
1298 |
-
# gr.Markdown("")
|
1299 |
-
# gr.Markdown("")
|
1300 |
-
# gr.Markdown("## Failure report for each agent")
|
1301 |
-
# with gr.Row():
|
1302 |
-
# with gr.Column(scale=1):
|
1303 |
-
# failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
|
1304 |
-
# with gr.Row():
|
1305 |
-
# with gr.Column(scale=1):
|
1306 |
-
# failure_categories_overview = gr.Markdown()
|
1307 |
-
|
1308 |
-
# with gr.Column(scale=1):
|
1309 |
-
# failure_categories_chart = gr.Plot()
|
1310 |
-
|
1311 |
-
# # Initialize the failure report agent dropdown with all agents
|
1312 |
-
# demo.load(update_agent_dropdown,
|
1313 |
-
# inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
|
1314 |
-
# outputs=[failure_report_agent_dropdown])
|
1315 |
-
|
1316 |
-
# # Update failure report when agent is selected
|
1317 |
-
# failure_report_agent_dropdown.change(update_failure_report,
|
1318 |
-
# inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_lite", visible=False)],
|
1319 |
-
# outputs=[failure_categories_overview, failure_categories_chart])
|
1320 |
-
|
1321 |
-
# gr.Markdown("")
|
1322 |
-
# gr.Markdown("")
|
1323 |
-
# gr.Markdown("## Agent monitor")
|
1324 |
-
# with gr.Row():
|
1325 |
-
# with gr.Column(scale=1):
|
1326 |
-
# agent_dropdown = gr.Dropdown(label="Select Agent")
|
1327 |
-
# with gr.Column(scale=1):
|
1328 |
-
# task_dropdown = gr.Dropdown(label="Select SWE-Bench Task")
|
1329 |
-
# with gr.Row():
|
1330 |
-
# task_overview = gr.Markdown()
|
1331 |
-
# with gr.Row():
|
1332 |
-
# flow_chart = gr.Plot(label="Task Flow")
|
1333 |
-
|
1334 |
-
# # Initialize the agent dropdown with the best agent
|
1335 |
-
# demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
|
1336 |
-
# demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
|
1337 |
-
|
1338 |
-
# agent_dropdown.change(update_task_analysis,
|
1339 |
-
# inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown],
|
1340 |
-
# outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
|
1341 |
-
# task_dropdown.change(update_task_details,
|
1342 |
-
# inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown, task_dropdown],
|
1343 |
-
# outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
|
1344 |
-
|
1345 |
-
|
1346 |
-
# gr.Markdown("## Raw predictions")
|
1347 |
-
# with gr.Row():
|
1348 |
-
# with gr.Column(scale=1):
|
1349 |
-
# raw_agent_dropdown = gr.Dropdown(label="Select Agent")
|
1350 |
-
# with gr.Column(scale=1):
|
1351 |
-
# raw_task_dropdown = gr.Dropdown(label="Select Task")
|
1352 |
-
# with gr.Column(scale=1):
|
1353 |
-
# raw_step_dropdown = gr.Dropdown(label="Select Step")
|
1354 |
-
|
1355 |
-
# with gr.Row():
|
1356 |
-
# raw_call_details = gr.HTML()
|
1357 |
-
|
1358 |
-
# def update_raw_task_dropdown(agent_name):
|
1359 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
|
1360 |
-
# if not analyzed_traces:
|
1361 |
-
# return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
|
1362 |
-
# task_ids = list(analyzed_traces.keys())
|
1363 |
-
# steps = analyzed_traces[task_ids[0]]['steps']
|
1364 |
-
# return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), update_raw_call_details(agent_name, task_ids[0], 0)
|
1365 |
-
|
1366 |
-
# def update_raw_step_dropdown(agent_name, task_id):
|
1367 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
|
1368 |
-
# if not analyzed_traces or task_id not in analyzed_traces:
|
1369 |
-
# return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
|
1370 |
-
# steps = analyzed_traces[task_id]['steps']
|
1371 |
-
# return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
|
1372 |
-
|
1373 |
-
# def update_raw_call_details(agent_name, task_id, step_index):
|
1374 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
|
1375 |
-
# if not analyzed_traces or task_id not in analyzed_traces:
|
1376 |
-
# return "No data available for this selection."
|
1377 |
-
# steps = analyzed_traces[task_id]['steps']
|
1378 |
-
# if step_index is None:
|
1379 |
-
# return "Invalid step selection."
|
1380 |
-
# step = steps[step_index]
|
1381 |
-
# return format_call_info(step, step_index)
|
1382 |
-
|
1383 |
-
# # Initialize the raw agent dropdown with all agents
|
1384 |
-
# demo.load(update_agent_dropdown,
|
1385 |
-
# inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
|
1386 |
-
# outputs=[raw_agent_dropdown])
|
1387 |
-
# demo.load(update_raw_task_dropdown,
|
1388 |
-
# inputs=[raw_agent_dropdown],
|
1389 |
-
# outputs=[raw_task_dropdown, raw_step_dropdown])
|
1390 |
-
# demo.load(update_raw_call_details,
|
1391 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1392 |
-
# outputs=[raw_call_details])
|
1393 |
-
|
1394 |
-
# raw_agent_dropdown.change(update_raw_task_dropdown,
|
1395 |
-
# inputs=[raw_agent_dropdown],
|
1396 |
-
# outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
|
1397 |
-
# raw_task_dropdown.change(update_raw_step_dropdown,
|
1398 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown],
|
1399 |
-
# outputs=[raw_step_dropdown, raw_call_details])
|
1400 |
-
# raw_step_dropdown.change(update_raw_call_details,
|
1401 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1402 |
-
# outputs=[raw_call_details])
|
1403 |
-
|
1404 |
-
|
1405 |
-
|
1406 |
-
# with gr.Tab("MLAgentBench"):
|
1407 |
-
# gr.Markdown("""MLAgentBench is a suite of end-to-end Machine Learning (ML) experimentation tasks, where the agent aims to take a given dataset and a machine learning task description and autonomously develop or improve an ML model. We are currently actively developing this platform and this benchmark is not fully implemented yet. In particular, we only include one agent and a subset of tasks for this benchmark.""")
|
1408 |
-
# with gr.Row():
|
1409 |
-
# with gr.Column(scale=2):
|
1410 |
-
# Leaderboard(
|
1411 |
-
# value=parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench'),
|
1412 |
-
# select_columns=SelectColumns(
|
1413 |
-
# default_selection=config.MLAGENTBENCH_ON_LOAD_COLUMNS + ["Verified"],
|
1414 |
-
# cant_deselect=["Agent Name"],
|
1415 |
-
# label="Select Columns to Display:",
|
1416 |
-
# ),
|
1417 |
-
# search_columns=config.MLAGENTBENCH_SEARCH_COLUMNS,
|
1418 |
-
# hide_columns=config.MLAGENTBENCH_HIDE_COLUMNS,
|
1419 |
-
# )
|
1420 |
-
# gr.Markdown("""*95% CIs calculated using Student's t-distribution.*""", elem_classes=["text-right"])
|
1421 |
-
# with gr.Row():
|
1422 |
-
# scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench', aggregate=False), "Total Cost", "Overall Score", "Total Cost (in USD)", "Overall Score", ["Agent Name"]))
|
1423 |
-
|
1424 |
-
# gr.Markdown("")
|
1425 |
-
# gr.Markdown("")
|
1426 |
-
# gr.Markdown("## Failure report for each agent")
|
1427 |
-
# with gr.Row():
|
1428 |
-
# with gr.Column(scale=1):
|
1429 |
-
# failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
|
1430 |
-
# with gr.Row():
|
1431 |
-
# with gr.Column(scale=1):
|
1432 |
-
# failure_categories_overview = gr.Markdown()
|
1433 |
-
|
1434 |
-
# with gr.Column(scale=1):
|
1435 |
-
# failure_categories_chart = gr.Plot()
|
1436 |
-
|
1437 |
-
# # Initialize the failure report agent dropdown with all agents
|
1438 |
-
# demo.load(update_agent_dropdown,
|
1439 |
-
# inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
|
1440 |
-
# outputs=[failure_report_agent_dropdown])
|
1441 |
-
|
1442 |
-
# # Update failure report when agent is selected
|
1443 |
-
# failure_report_agent_dropdown.change(update_failure_report,
|
1444 |
-
# inputs=[failure_report_agent_dropdown, gr.Textbox(value="mlagentbench", visible=False)],
|
1445 |
-
# outputs=[failure_categories_overview, failure_categories_chart])
|
1446 |
-
|
1447 |
-
# gr.Markdown("")
|
1448 |
-
# gr.Markdown("")
|
1449 |
-
# gr.Markdown("## Agent monitor")
|
1450 |
-
# with gr.Row():
|
1451 |
-
# with gr.Column(scale=1):
|
1452 |
-
# agent_dropdown = gr.Dropdown(label="Select Agent")
|
1453 |
-
# with gr.Column(scale=1):
|
1454 |
-
# task_dropdown = gr.Dropdown(label="Select SWE-Bench Task")
|
1455 |
-
# with gr.Row():
|
1456 |
-
# task_overview = gr.Markdown()
|
1457 |
-
# with gr.Row():
|
1458 |
-
# flow_chart = gr.Plot(label="Task Flow")
|
1459 |
-
|
1460 |
-
# # Initialize the agent dropdown with the best agent
|
1461 |
-
# demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)], outputs=[agent_dropdown])
|
1462 |
-
# demo.load(update_task_analysis, inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
|
1463 |
-
|
1464 |
-
# agent_dropdown.change(update_task_analysis,
|
1465 |
-
# inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown],
|
1466 |
-
# outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
|
1467 |
-
# task_dropdown.change(update_task_details,
|
1468 |
-
# inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown, task_dropdown],
|
1469 |
-
# outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
|
1470 |
-
|
1471 |
-
|
1472 |
-
# gr.Markdown("## Raw predictions")
|
1473 |
-
# with gr.Row():
|
1474 |
-
# with gr.Column(scale=1):
|
1475 |
-
# raw_agent_dropdown = gr.Dropdown(label="Select Agent")
|
1476 |
-
# with gr.Column(scale=1):
|
1477 |
-
# raw_task_dropdown = gr.Dropdown(label="Select Task")
|
1478 |
-
# with gr.Column(scale=1):
|
1479 |
-
# raw_step_dropdown = gr.Dropdown(label="Select Step")
|
1480 |
-
|
1481 |
-
# with gr.Row():
|
1482 |
-
# raw_call_details = gr.HTML()
|
1483 |
-
|
1484 |
-
# def update_raw_task_dropdown(agent_name):
|
1485 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
|
1486 |
-
# if not analyzed_traces:
|
1487 |
-
# return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
|
1488 |
-
# task_ids = list(analyzed_traces.keys())
|
1489 |
-
# steps = analyzed_traces[task_ids[0]]['steps']
|
1490 |
-
# return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), update_raw_call_details(agent_name, task_ids[0], 0)
|
1491 |
-
|
1492 |
-
# def update_raw_step_dropdown(agent_name, task_id):
|
1493 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
|
1494 |
-
# if not analyzed_traces or task_id not in analyzed_traces:
|
1495 |
-
# return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
|
1496 |
-
# steps = analyzed_traces[task_id]['steps']
|
1497 |
-
# return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
|
1498 |
-
|
1499 |
-
# def update_raw_call_details(agent_name, task_id, step_index):
|
1500 |
-
# analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
|
1501 |
-
# if not analyzed_traces or task_id not in analyzed_traces:
|
1502 |
-
# return "No data available for this selection."
|
1503 |
-
# steps = analyzed_traces[task_id]['steps']
|
1504 |
-
# if step_index is None:
|
1505 |
-
# return "Invalid step selection."
|
1506 |
-
# step = steps[step_index]
|
1507 |
-
# return format_call_info(step, step_index)
|
1508 |
-
|
1509 |
-
# # Initialize the raw agent dropdown with all agents
|
1510 |
-
# demo.load(update_agent_dropdown,
|
1511 |
-
# inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
|
1512 |
-
# outputs=[raw_agent_dropdown])
|
1513 |
-
# demo.load(update_raw_task_dropdown,
|
1514 |
-
# inputs=[raw_agent_dropdown],
|
1515 |
-
# outputs=[raw_task_dropdown, raw_step_dropdown])
|
1516 |
-
# demo.load(update_raw_call_details,
|
1517 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1518 |
-
# outputs=[raw_call_details])
|
1519 |
-
|
1520 |
-
# raw_agent_dropdown.change(update_raw_task_dropdown,
|
1521 |
-
# inputs=[raw_agent_dropdown],
|
1522 |
-
# outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
|
1523 |
-
# raw_task_dropdown.change(update_raw_step_dropdown,
|
1524 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown],
|
1525 |
-
# outputs=[raw_step_dropdown, raw_call_details])
|
1526 |
-
# raw_step_dropdown.change(update_raw_call_details,
|
1527 |
-
# inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1528 |
-
# outputs=[raw_call_details])
|
1529 |
-
|
1530 |
|
1531 |
with gr.Tab("About"):
|
1532 |
gr.Markdown((Path(__file__).parent / "about.md").read_text())
|
1533 |
|
1534 |
|
1535 |
-
gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>
|
1536 |
-
<p>Below we provide a guide on how to add an agent to the leaderboard:</p>""")
|
1537 |
gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
|
1538 |
-
gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>
|
1539 |
-
<p>Below we provide a guide on how to add a benchmark to the leaderboard:</p>""")
|
1540 |
gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
|
1541 |
-
gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>
|
1542 |
-
<p>Below we provide a guide on how to reproduce evaluations:</p>""")
|
1543 |
gr.Markdown("""Coming soon...""")
|
1544 |
|
1545 |
|
|
|
391 |
.user-type-links a {
|
392 |
display: inline-block;
|
393 |
padding: 5px 12px;
|
394 |
+
margin-bottom: 5px;
|
395 |
background-color: #f0f4f8;
|
396 |
color: #2c3e50 !important; /* Force the color change */
|
397 |
text-decoration: none !important; /* Force remove underline */
|
|
|
1125 |
outputs=[raw_step_dropdown, raw_call_details])
|
1126 |
raw_step_dropdown.change(update_raw_call_details,
|
1127 |
inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
|
1128 |
+
outputs=[raw_call_details])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1129 |
|
1130 |
with gr.Tab("About"):
|
1131 |
gr.Markdown((Path(__file__).parent / "about.md").read_text())
|
1132 |
|
1133 |
|
1134 |
+
gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>""")
|
|
|
1135 |
gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
|
1136 |
+
gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>""")
|
|
|
1137 |
gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
|
1138 |
+
gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
|
|
|
1139 |
gr.Markdown("""Coming soon...""")
|
1140 |
|
1141 |
|
benchmark_submission.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
To submit **a new benchmark
|
2 |
|
3 |
1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
|
4 |
|
|
|
1 |
+
To submit **a new benchmark** to the library:
|
2 |
|
3 |
1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
|
4 |
|