benediktstroebl commited on
Commit
dabe474
·
verified ·
1 Parent(s): 6d5d4aa

Upload 3 files

Browse files
Files changed (3) hide show
  1. about.md +1 -14
  2. app.py +5 -409
  3. benchmark_submission.md +1 -1
about.md CHANGED
@@ -47,17 +47,4 @@ We see HAL being useful for four categories of users:
47
  1. Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
48
  2. Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
49
  3. Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
50
- 4. Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).
51
-
52
- ## Platform demo
53
-
54
- The platform features a user-friendly frontend for accessing and interacting with the evaluation results generated with the evaluation harness.
55
-
56
- - Public leaderboards: The public leaderboard displays agent rankings for each supported benchmark. While some metrics are benchmark-specific, others, for example, cost, are reported for each.
57
- - Automatic Pareto frontiers: We automatically determine the convex hull of agents for each benchmark and visualize it in a scatter plot.
58
- - Verified Results: We will prominently show which results we verified (by re-running them). We plan to periodically re-run the top 5 agents for each benchmark in order to always have a verified SOTA agent.
59
- - Task completion heatmap: Of all tasks in the benchmark, how many were solved by which agent? How many were solved by at least one?
60
- - Qualitative Failure Mode Analysis: In addition to the raw predictions, we also provide an LLM-based analysis of the recurring failure modes of each agent and plot the number of affected tasks for each identified failure mode.
61
- - Agent Monitor: To aid visibility into which steps an agent took and compare the approaches of agents on the same task, we provide a visual overview of the steps an agent took. For each step, we include an LLM-generated summary of the action in light of the overall goal of a task.
62
- - Raw Traces: All uploaded results are publicly accessible, and detailed information about each evaluation, including API parameters, token usage, and I/O, is made available.
63
- - Submission Interface: We provide a submission interface for researchers to upload their evaluation results in a standardized way across all benchmarks supported by the platform to support easy integration of benchmark results for any given agent.
 
47
  1. Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
48
  2. Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
49
  3. Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
50
+ 4. Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -391,6 +391,7 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
391
  .user-type-links a {
392
  display: inline-block;
393
  padding: 5px 12px;
 
394
  background-color: #f0f4f8;
395
  color: #2c3e50 !important; /* Force the color change */
396
  text-decoration: none !important; /* Force remove underline */
@@ -1124,422 +1125,17 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
1124
  outputs=[raw_step_dropdown, raw_call_details])
1125
  raw_step_dropdown.change(update_raw_call_details,
1126
  inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1127
- outputs=[raw_call_details])
1128
-
1129
-
1130
- # with gr.Tab("SWE-Bench Verified"):
1131
- # gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Verified is a human-validated subset of 500 problems reviewed by software engineers. The We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
1132
- # with gr.Row():
1133
- # with gr.Column(scale=2):
1134
- # Leaderboard(
1135
- # value=parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified'),
1136
- # select_columns=SelectColumns(
1137
- # default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
1138
- # cant_deselect=["Agent Name"],
1139
- # label="Select Columns to Display:",
1140
- # ),
1141
- # hide_columns=config.SWEBENCH_HIDE_COLUMNS,
1142
- # search_columns=config.SWEBENCH_SEARCH_COLUMNS
1143
- # )
1144
- # gr.Markdown("""*95% CIs calculated using Student's t-distribution.*""", elem_classes=["text-right"])
1145
- # with gr.Row():
1146
- # scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
1147
-
1148
- # gr.Markdown("")
1149
- # gr.Markdown("")
1150
- # gr.Markdown("## Task success heatmap")
1151
- # with gr.Row():
1152
- # task_success_heatmap = gr.Plot()
1153
- # demo.load(
1154
- # lambda: create_task_success_heatmap(
1155
- # preprocessor.get_task_success_data('swebench_verified'),
1156
- # 'SWEBench Verified'
1157
- # ),
1158
- # outputs=[task_success_heatmap]
1159
- # )
1160
-
1161
- # gr.Markdown("")
1162
- # gr.Markdown("")
1163
- # gr.Markdown("## Failure report for each agent")
1164
- # with gr.Row():
1165
- # with gr.Column(scale=1):
1166
- # failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
1167
- # with gr.Row():
1168
- # with gr.Column(scale=1):
1169
- # failure_categories_overview = gr.Markdown()
1170
-
1171
- # with gr.Column(scale=1):
1172
- # failure_categories_chart = gr.Plot()
1173
-
1174
- # # Initialize the failure report agent dropdown with all agents
1175
- # demo.load(update_agent_dropdown,
1176
- # inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1177
- # outputs=[failure_report_agent_dropdown])
1178
-
1179
- # # Update failure report when agent is selected
1180
- # failure_report_agent_dropdown.change(update_failure_report,
1181
- # inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_verified", visible=False)],
1182
- # outputs=[failure_categories_overview, failure_categories_chart])
1183
-
1184
- # gr.Markdown("")
1185
- # gr.Markdown("")
1186
- # gr.Markdown("## Agent monitor")
1187
- # with gr.Row():
1188
- # with gr.Column(scale=1):
1189
- # agent_dropdown = gr.Dropdown(label="Select Agent")
1190
- # with gr.Column(scale=1):
1191
- # task_dropdown = gr.Dropdown(label="Select SWE-Bench Task")
1192
- # with gr.Row():
1193
- # task_overview = gr.Markdown()
1194
- # with gr.Row():
1195
- # flow_chart = gr.Plot(label="Task Flow")
1196
-
1197
- # # Initialize the agent dropdown with the best agent
1198
- # demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
1199
- # demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1200
-
1201
- # agent_dropdown.change(update_task_analysis,
1202
- # inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown],
1203
- # outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1204
- # task_dropdown.change(update_task_details,
1205
- # inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown, task_dropdown],
1206
- # outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
1207
-
1208
- # gr.Markdown("## Raw predictions")
1209
- # with gr.Row():
1210
- # with gr.Column(scale=1):
1211
- # raw_agent_dropdown = gr.Dropdown(label="Select Agent")
1212
- # with gr.Column(scale=1):
1213
- # raw_task_dropdown = gr.Dropdown(label="Select Task")
1214
- # with gr.Column(scale=1):
1215
- # raw_step_dropdown = gr.Dropdown(label="Select Step")
1216
-
1217
- # with gr.Row():
1218
- # raw_call_details = gr.HTML()
1219
-
1220
- # def update_raw_task_dropdown(agent_name):
1221
- # analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
1222
- # if not analyzed_traces:
1223
- # return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
1224
- # task_ids = list(analyzed_traces.keys())
1225
- # steps = analyzed_traces[task_ids[0]]['steps']
1226
- # return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), update_raw_call_details(agent_name, task_ids[0], 0)
1227
-
1228
- # def update_raw_step_dropdown(agent_name, task_id):
1229
- # analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
1230
- # if not analyzed_traces or task_id not in analyzed_traces:
1231
- # return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
1232
- # steps = analyzed_traces[task_id]['steps']
1233
- # return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
1234
-
1235
- # def update_raw_call_details(agent_name, task_id, step_index):
1236
- # analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
1237
- # if not analyzed_traces or task_id not in analyzed_traces:
1238
- # return "No data available for this selection."
1239
- # steps = analyzed_traces[task_id]['steps']
1240
- # if step_index is None:
1241
- # return "Invalid step selection."
1242
- # step = steps[step_index]
1243
- # return format_call_info(step, step_index)
1244
-
1245
- # # Initialize the raw agent dropdown with all agents
1246
- # demo.load(update_agent_dropdown,
1247
- # inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1248
- # outputs=[raw_agent_dropdown])
1249
- # demo.load(update_raw_task_dropdown,
1250
- # inputs=[raw_agent_dropdown],
1251
- # outputs=[raw_task_dropdown, raw_step_dropdown])
1252
- # demo.load(update_raw_call_details,
1253
- # inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1254
- # outputs=[raw_call_details])
1255
-
1256
- # raw_agent_dropdown.change(update_raw_task_dropdown,
1257
- # inputs=[raw_agent_dropdown],
1258
- # outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
1259
- # raw_task_dropdown.change(update_raw_step_dropdown,
1260
- # inputs=[raw_agent_dropdown, raw_task_dropdown],
1261
- # outputs=[raw_step_dropdown, raw_call_details])
1262
- # raw_step_dropdown.change(update_raw_call_details,
1263
- # inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1264
- # outputs=[raw_call_details])
1265
-
1266
- # with gr.Tab("SWE-Bench Lite"):
1267
- # gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Lite is a subset of 300 tasks of the original SWE-bench. We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
1268
- # with gr.Row():
1269
- # with gr.Column(scale=2):
1270
- # Leaderboard(
1271
- # value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite'), ci_metrics=['Accuracy', 'Total Cost']),
1272
- # select_columns=SelectColumns(
1273
- # default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
1274
- # cant_deselect=["Agent Name"],
1275
- # label="Select Columns to Display:",
1276
- # ),
1277
- # search_columns=config.SWEBENCH_SEARCH_COLUMNS,
1278
- # hide_columns=config.SWEBENCH_HIDE_COLUMNS
1279
- # )
1280
- # # make right aligned markdown
1281
- # gr.Markdown("""*95% CIs calculated using Student's t-distribution.*""", elem_classes=["text-right"])
1282
- # with gr.Row():
1283
- # scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite', aggregate=True), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
1284
-
1285
- # gr.Markdown("")
1286
- # gr.Markdown("")
1287
- # gr.Markdown("## Task success heatmap")
1288
- # with gr.Row():
1289
- # task_success_heatmap = gr.Plot()
1290
- # demo.load(
1291
- # lambda: create_task_success_heatmap(
1292
- # preprocessor.get_task_success_data('swebench_lite'),
1293
- # 'SWEBench Lite'
1294
- # ),
1295
- # outputs=[task_success_heatmap]
1296
- # )
1297
-
1298
- # gr.Markdown("")
1299
- # gr.Markdown("")
1300
- # gr.Markdown("## Failure report for each agent")
1301
- # with gr.Row():
1302
- # with gr.Column(scale=1):
1303
- # failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
1304
- # with gr.Row():
1305
- # with gr.Column(scale=1):
1306
- # failure_categories_overview = gr.Markdown()
1307
-
1308
- # with gr.Column(scale=1):
1309
- # failure_categories_chart = gr.Plot()
1310
-
1311
- # # Initialize the failure report agent dropdown with all agents
1312
- # demo.load(update_agent_dropdown,
1313
- # inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1314
- # outputs=[failure_report_agent_dropdown])
1315
-
1316
- # # Update failure report when agent is selected
1317
- # failure_report_agent_dropdown.change(update_failure_report,
1318
- # inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_lite", visible=False)],
1319
- # outputs=[failure_categories_overview, failure_categories_chart])
1320
-
1321
- # gr.Markdown("")
1322
- # gr.Markdown("")
1323
- # gr.Markdown("## Agent monitor")
1324
- # with gr.Row():
1325
- # with gr.Column(scale=1):
1326
- # agent_dropdown = gr.Dropdown(label="Select Agent")
1327
- # with gr.Column(scale=1):
1328
- # task_dropdown = gr.Dropdown(label="Select SWE-Bench Task")
1329
- # with gr.Row():
1330
- # task_overview = gr.Markdown()
1331
- # with gr.Row():
1332
- # flow_chart = gr.Plot(label="Task Flow")
1333
-
1334
- # # Initialize the agent dropdown with the best agent
1335
- # demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
1336
- # demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1337
-
1338
- # agent_dropdown.change(update_task_analysis,
1339
- # inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown],
1340
- # outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1341
- # task_dropdown.change(update_task_details,
1342
- # inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown, task_dropdown],
1343
- # outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
1344
-
1345
-
1346
- # gr.Markdown("## Raw predictions")
1347
- # with gr.Row():
1348
- # with gr.Column(scale=1):
1349
- # raw_agent_dropdown = gr.Dropdown(label="Select Agent")
1350
- # with gr.Column(scale=1):
1351
- # raw_task_dropdown = gr.Dropdown(label="Select Task")
1352
- # with gr.Column(scale=1):
1353
- # raw_step_dropdown = gr.Dropdown(label="Select Step")
1354
-
1355
- # with gr.Row():
1356
- # raw_call_details = gr.HTML()
1357
-
1358
- # def update_raw_task_dropdown(agent_name):
1359
- # analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
1360
- # if not analyzed_traces:
1361
- # return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
1362
- # task_ids = list(analyzed_traces.keys())
1363
- # steps = analyzed_traces[task_ids[0]]['steps']
1364
- # return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), update_raw_call_details(agent_name, task_ids[0], 0)
1365
-
1366
- # def update_raw_step_dropdown(agent_name, task_id):
1367
- # analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
1368
- # if not analyzed_traces or task_id not in analyzed_traces:
1369
- # return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
1370
- # steps = analyzed_traces[task_id]['steps']
1371
- # return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
1372
-
1373
- # def update_raw_call_details(agent_name, task_id, step_index):
1374
- # analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
1375
- # if not analyzed_traces or task_id not in analyzed_traces:
1376
- # return "No data available for this selection."
1377
- # steps = analyzed_traces[task_id]['steps']
1378
- # if step_index is None:
1379
- # return "Invalid step selection."
1380
- # step = steps[step_index]
1381
- # return format_call_info(step, step_index)
1382
-
1383
- # # Initialize the raw agent dropdown with all agents
1384
- # demo.load(update_agent_dropdown,
1385
- # inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1386
- # outputs=[raw_agent_dropdown])
1387
- # demo.load(update_raw_task_dropdown,
1388
- # inputs=[raw_agent_dropdown],
1389
- # outputs=[raw_task_dropdown, raw_step_dropdown])
1390
- # demo.load(update_raw_call_details,
1391
- # inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1392
- # outputs=[raw_call_details])
1393
-
1394
- # raw_agent_dropdown.change(update_raw_task_dropdown,
1395
- # inputs=[raw_agent_dropdown],
1396
- # outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
1397
- # raw_task_dropdown.change(update_raw_step_dropdown,
1398
- # inputs=[raw_agent_dropdown, raw_task_dropdown],
1399
- # outputs=[raw_step_dropdown, raw_call_details])
1400
- # raw_step_dropdown.change(update_raw_call_details,
1401
- # inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1402
- # outputs=[raw_call_details])
1403
-
1404
-
1405
-
1406
- # with gr.Tab("MLAgentBench"):
1407
- # gr.Markdown("""MLAgentBench is a suite of end-to-end Machine Learning (ML) experimentation tasks, where the agent aims to take a given dataset and a machine learning task description and autonomously develop or improve an ML model. We are currently actively developing this platform and this benchmark is not fully implemented yet. In particular, we only include one agent and a subset of tasks for this benchmark.""")
1408
- # with gr.Row():
1409
- # with gr.Column(scale=2):
1410
- # Leaderboard(
1411
- # value=parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench'),
1412
- # select_columns=SelectColumns(
1413
- # default_selection=config.MLAGENTBENCH_ON_LOAD_COLUMNS + ["Verified"],
1414
- # cant_deselect=["Agent Name"],
1415
- # label="Select Columns to Display:",
1416
- # ),
1417
- # search_columns=config.MLAGENTBENCH_SEARCH_COLUMNS,
1418
- # hide_columns=config.MLAGENTBENCH_HIDE_COLUMNS,
1419
- # )
1420
- # gr.Markdown("""*95% CIs calculated using Student's t-distribution.*""", elem_classes=["text-right"])
1421
- # with gr.Row():
1422
- # scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench', aggregate=False), "Total Cost", "Overall Score", "Total Cost (in USD)", "Overall Score", ["Agent Name"]))
1423
-
1424
- # gr.Markdown("")
1425
- # gr.Markdown("")
1426
- # gr.Markdown("## Failure report for each agent")
1427
- # with gr.Row():
1428
- # with gr.Column(scale=1):
1429
- # failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
1430
- # with gr.Row():
1431
- # with gr.Column(scale=1):
1432
- # failure_categories_overview = gr.Markdown()
1433
-
1434
- # with gr.Column(scale=1):
1435
- # failure_categories_chart = gr.Plot()
1436
-
1437
- # # Initialize the failure report agent dropdown with all agents
1438
- # demo.load(update_agent_dropdown,
1439
- # inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
1440
- # outputs=[failure_report_agent_dropdown])
1441
-
1442
- # # Update failure report when agent is selected
1443
- # failure_report_agent_dropdown.change(update_failure_report,
1444
- # inputs=[failure_report_agent_dropdown, gr.Textbox(value="mlagentbench", visible=False)],
1445
- # outputs=[failure_categories_overview, failure_categories_chart])
1446
-
1447
- # gr.Markdown("")
1448
- # gr.Markdown("")
1449
- # gr.Markdown("## Agent monitor")
1450
- # with gr.Row():
1451
- # with gr.Column(scale=1):
1452
- # agent_dropdown = gr.Dropdown(label="Select Agent")
1453
- # with gr.Column(scale=1):
1454
- # task_dropdown = gr.Dropdown(label="Select SWE-Bench Task")
1455
- # with gr.Row():
1456
- # task_overview = gr.Markdown()
1457
- # with gr.Row():
1458
- # flow_chart = gr.Plot(label="Task Flow")
1459
-
1460
- # # Initialize the agent dropdown with the best agent
1461
- # demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)], outputs=[agent_dropdown])
1462
- # demo.load(update_task_analysis, inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1463
-
1464
- # agent_dropdown.change(update_task_analysis,
1465
- # inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown],
1466
- # outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1467
- # task_dropdown.change(update_task_details,
1468
- # inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown, task_dropdown],
1469
- # outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
1470
-
1471
-
1472
- # gr.Markdown("## Raw predictions")
1473
- # with gr.Row():
1474
- # with gr.Column(scale=1):
1475
- # raw_agent_dropdown = gr.Dropdown(label="Select Agent")
1476
- # with gr.Column(scale=1):
1477
- # raw_task_dropdown = gr.Dropdown(label="Select Task")
1478
- # with gr.Column(scale=1):
1479
- # raw_step_dropdown = gr.Dropdown(label="Select Step")
1480
-
1481
- # with gr.Row():
1482
- # raw_call_details = gr.HTML()
1483
-
1484
- # def update_raw_task_dropdown(agent_name):
1485
- # analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
1486
- # if not analyzed_traces:
1487
- # return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
1488
- # task_ids = list(analyzed_traces.keys())
1489
- # steps = analyzed_traces[task_ids[0]]['steps']
1490
- # return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), update_raw_call_details(agent_name, task_ids[0], 0)
1491
-
1492
- # def update_raw_step_dropdown(agent_name, task_id):
1493
- # analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
1494
- # if not analyzed_traces or task_id not in analyzed_traces:
1495
- # return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
1496
- # steps = analyzed_traces[task_id]['steps']
1497
- # return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
1498
-
1499
- # def update_raw_call_details(agent_name, task_id, step_index):
1500
- # analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
1501
- # if not analyzed_traces or task_id not in analyzed_traces:
1502
- # return "No data available for this selection."
1503
- # steps = analyzed_traces[task_id]['steps']
1504
- # if step_index is None:
1505
- # return "Invalid step selection."
1506
- # step = steps[step_index]
1507
- # return format_call_info(step, step_index)
1508
-
1509
- # # Initialize the raw agent dropdown with all agents
1510
- # demo.load(update_agent_dropdown,
1511
- # inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
1512
- # outputs=[raw_agent_dropdown])
1513
- # demo.load(update_raw_task_dropdown,
1514
- # inputs=[raw_agent_dropdown],
1515
- # outputs=[raw_task_dropdown, raw_step_dropdown])
1516
- # demo.load(update_raw_call_details,
1517
- # inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1518
- # outputs=[raw_call_details])
1519
-
1520
- # raw_agent_dropdown.change(update_raw_task_dropdown,
1521
- # inputs=[raw_agent_dropdown],
1522
- # outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
1523
- # raw_task_dropdown.change(update_raw_step_dropdown,
1524
- # inputs=[raw_agent_dropdown, raw_task_dropdown],
1525
- # outputs=[raw_step_dropdown, raw_call_details])
1526
- # raw_step_dropdown.change(update_raw_call_details,
1527
- # inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1528
- # outputs=[raw_call_details])
1529
-
1530
 
1531
  with gr.Tab("About"):
1532
  gr.Markdown((Path(__file__).parent / "about.md").read_text())
1533
 
1534
 
1535
- gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>
1536
- <p>Below we provide a guide on how to add an agent to the leaderboard:</p>""")
1537
  gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
1538
- gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>
1539
- <p>Below we provide a guide on how to add a benchmark to the leaderboard:</p>""")
1540
  gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
1541
- gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>
1542
- <p>Below we provide a guide on how to reproduce evaluations:</p>""")
1543
  gr.Markdown("""Coming soon...""")
1544
 
1545
 
 
391
  .user-type-links a {
392
  display: inline-block;
393
  padding: 5px 12px;
394
+ margin-bottom: 5px;
395
  background-color: #f0f4f8;
396
  color: #2c3e50 !important; /* Force the color change */
397
  text-decoration: none !important; /* Force remove underline */
 
1125
  outputs=[raw_step_dropdown, raw_call_details])
1126
  raw_step_dropdown.change(update_raw_call_details,
1127
  inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1128
+ outputs=[raw_call_details])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1129
 
1130
  with gr.Tab("About"):
1131
  gr.Markdown((Path(__file__).parent / "about.md").read_text())
1132
 
1133
 
1134
+ gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>""")
 
1135
  gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
1136
+ gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>""")
 
1137
  gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
1138
+ gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
 
1139
  gr.Markdown("""Coming soon...""")
1140
 
1141
 
benchmark_submission.md CHANGED
@@ -1,4 +1,4 @@
1
- To submit **a new benchmark **to the library:
2
 
3
  1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
4
 
 
1
+ To submit **a new benchmark** to the library:
2
 
3
  1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
4