Spaces:

galileo-ai
/

agent-leaderboard

Running on CPU Upgrade

App Files Files Community

Pratik Bhavsar commited on 9 days ago

Commit

712db9d

1 Parent(s): b9405c8

added more info

Browse files

Files changed (1) hide show

data_loader.py +9 -11

data_loader.py CHANGED Viewed

@@ -1044,12 +1044,9 @@ METHODOLOGY = """
     </div>
-    <div></div>
-    <h1 class="section-title">How Do We Measure Agent's Performance?</h1>
-    <div>
-        <p>The complexity of tool calling extends far beyond simple API invocations. We developed the Tool Selection Quality metric to assess agents' tool call performance, evaluating tool selection accuracy and effectiveness of parameter usage.</p>
     <div class="section-divider"></div>
     <h2 class="subsection-title">Scenario Recognition</h2>
     <div class="explanation-block">
         <p>When an agent encounters a query, it must first determine if tool usage is warranted. Information may already exist in the conversation history, making tool calls redundant. Alternatively, available tools might be insufficient or irrelevant to the task, requiring the agent to acknowledge limitations rather than force inappropriate tool usage.</p>
@@ -1084,10 +1081,10 @@ METHODOLOGY = """
             <li>Adapt to partial results or failures</li>
         </ul>
     </div>
-<div class="section-divider"></div>
-<h2 class="subsection-title">Example code</h2>
-<p class="code-intro">Below is the code example of evaluating an LLM for a dataset.</p>
         <div class="code-block">
             <pre>
@@ -1106,14 +1103,15 @@ evaluate_handler = pq.GalileoPromptCallback(
         scorers=[chainpoll_tool_selection_scorer],
     )
-llm = llm_handler.get_llm(model, temperature=0.0, max_tokens=4000)
 system_msg = {
             "role": "system",
             "content": 'Your job is to use the given tools to answer the query of human. If there is no relevant tool then reply with "I cannot answer the question with given tools". If tool is available but sufficient information is not available, then ask human to get the same. You can call as many tools as you want. Use multiple tools if needed. If the tools need to be called in a sequence then just call the first tool.',
         }
 for row in df.itertuples():
-    chain = llm.bind_tools(tools)
     outputs.append(
             chain.invoke(
                 [system_msg, *row.conversation],

     </div>
     <div class="section-divider"></div>
+    <h1 class="section-title">What Makes Tool Selection Hard?</h1>
+        <div class="section-divider"></div>
     <h2 class="subsection-title">Scenario Recognition</h2>
     <div class="explanation-block">
         <p>When an agent encounters a query, it must first determine if tool usage is warranted. Information may already exist in the conversation history, making tool calls redundant. Alternatively, available tools might be insufficient or irrelevant to the task, requiring the agent to acknowledge limitations rather than force inappropriate tool usage.</p>
             <li>Adapt to partial results or failures</li>
         </ul>
     </div>
+    <div class="section-divider"></div>
+    <h1 class="section-title">How Do We Measure Agent's Performance?</h1>
+<p class="code-intro">We developed the Tool Selection Quality metric to assess agents' tool call performance, evaluating tool selection accuracy and effectiveness of parameter usage. This is an example code for evaluating an LLM with a dataset with Galileo's Tool Selection Quality.</p>
         <div class="code-block">
             <pre>
         scorers=[chainpoll_tool_selection_scorer],
     )
+llm = llm_handler.get_llm(model, temperature=0.0, max_tokens=4000) # llm_handler is a custom handler for LLMs
 system_msg = {
             "role": "system",
             "content": 'Your job is to use the given tools to answer the query of human. If there is no relevant tool then reply with "I cannot answer the question with given tools". If tool is available but sufficient information is not available, then ask human to get the same. You can call as many tools as you want. Use multiple tools if needed. If the tools need to be called in a sequence then just call the first tool.',
         }
 for row in df.itertuples():
+    chain = llm.bind_tools(tools) # attach the tools
     outputs.append(
             chain.invoke(
                 [system_msg, *row.conversation],