Spaces:

ServiceNow
/

browsergym-leaderboard

Running

jardinet-souffleton commited on 3 days ago

Commit

d2de681

1 Parent(s): 6d59540

Add README and workarena-l1.json for GenericAgent-o1-mini and GenericAgent-o3-mini

Files changed (4) hide show

results/GenericAgent-o1-mini/README.md ADDED Viewed

+### GenericAgent-o1-mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses o1-mini as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/agent_configs.py):
+```python
+BASE_FLAGS = FLAGS_GPT_4o = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=False,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        action_set=bgym.HighLevelActionSetArgs(
+            subsets=["bid"],
+            multiaction=False,
+        ),
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-o1-mini/workarena-l1.json ADDED Viewed

+[
+    {
+        "agent_name": "GenericAgent-o1-mini",
+        "study_id": "f3e1fcb8-5fc5-4115-9e00-27251508e2c7",
+        "date_time": "2025-02-07 14:00:00",
+        "benchmark": "WorkArena-L1",
+        "score": 51.8,
+        "std_err": 2.80,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "Additional details",
+        "original_or_reproduced": "Reproduced"
+    }
+]

results/GenericAgent-o3-mini/README.md ADDED Viewed

+### GenericAgent-o3-mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses o1-mini as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/agent_configs.py):
+```python
+BASE_FLAGS = FLAGS_GPT_4o = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=False,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        action_set=bgym.HighLevelActionSetArgs(
+            subsets=["bid"],
+            multiaction=False,
+        ),
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-o3-mini/workarena-l1.json ADDED Viewed

+[
+    {
+        "agent_name": "GenericAgent-o3-mini",
+        "study_id": "f3e1fcb8-5fc5-4115-9e00-27251508e2c7",
+        "date_time": "2025-02-07 14:00:00",
+        "benchmark": "WorkArena-L1",
+        "score": 48.2,
+        "std_err": 2.80,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "Additional details",
+        "original_or_reproduced": "Original"
+    }
+]