Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Pratik Bhavsar
commited on
Commit
·
712db9d
1
Parent(s):
b9405c8
added more info
Browse files- data_loader.py +9 -11
data_loader.py
CHANGED
@@ -1044,12 +1044,9 @@ METHODOLOGY = """
|
|
1044 |
</div>
|
1045 |
|
1046 |
|
1047 |
-
<div></div>
|
1048 |
-
<h1 class="section-title">How Do We Measure Agent's Performance?</h1>
|
1049 |
-
<div>
|
1050 |
-
<p>The complexity of tool calling extends far beyond simple API invocations. We developed the Tool Selection Quality metric to assess agents' tool call performance, evaluating tool selection accuracy and effectiveness of parameter usage.</p>
|
1051 |
-
|
1052 |
<div class="section-divider"></div>
|
|
|
|
|
1053 |
<h2 class="subsection-title">Scenario Recognition</h2>
|
1054 |
<div class="explanation-block">
|
1055 |
<p>When an agent encounters a query, it must first determine if tool usage is warranted. Information may already exist in the conversation history, making tool calls redundant. Alternatively, available tools might be insufficient or irrelevant to the task, requiring the agent to acknowledge limitations rather than force inappropriate tool usage.</p>
|
@@ -1084,10 +1081,10 @@ METHODOLOGY = """
|
|
1084 |
<li>Adapt to partial results or failures</li>
|
1085 |
</ul>
|
1086 |
</div>
|
1087 |
-
|
1088 |
-
<div class="section-divider"></div>
|
1089 |
-
<
|
1090 |
-
<p class="code-intro">
|
1091 |
<div class="code-block">
|
1092 |
|
1093 |
<pre>
|
@@ -1106,14 +1103,15 @@ evaluate_handler = pq.GalileoPromptCallback(
|
|
1106 |
scorers=[chainpoll_tool_selection_scorer],
|
1107 |
)
|
1108 |
|
1109 |
-
llm = llm_handler.get_llm(model, temperature=0.0, max_tokens=4000)
|
|
|
1110 |
system_msg = {
|
1111 |
"role": "system",
|
1112 |
"content": 'Your job is to use the given tools to answer the query of human. If there is no relevant tool then reply with "I cannot answer the question with given tools". If tool is available but sufficient information is not available, then ask human to get the same. You can call as many tools as you want. Use multiple tools if needed. If the tools need to be called in a sequence then just call the first tool.',
|
1113 |
}
|
1114 |
|
1115 |
for row in df.itertuples():
|
1116 |
-
chain = llm.bind_tools(tools)
|
1117 |
outputs.append(
|
1118 |
chain.invoke(
|
1119 |
[system_msg, *row.conversation],
|
|
|
1044 |
</div>
|
1045 |
|
1046 |
|
|
|
|
|
|
|
|
|
|
|
1047 |
<div class="section-divider"></div>
|
1048 |
+
<h1 class="section-title">What Makes Tool Selection Hard?</h1>
|
1049 |
+
<div class="section-divider"></div>
|
1050 |
<h2 class="subsection-title">Scenario Recognition</h2>
|
1051 |
<div class="explanation-block">
|
1052 |
<p>When an agent encounters a query, it must first determine if tool usage is warranted. Information may already exist in the conversation history, making tool calls redundant. Alternatively, available tools might be insufficient or irrelevant to the task, requiring the agent to acknowledge limitations rather than force inappropriate tool usage.</p>
|
|
|
1081 |
<li>Adapt to partial results or failures</li>
|
1082 |
</ul>
|
1083 |
</div>
|
1084 |
+
|
1085 |
+
<div class="section-divider"></div>
|
1086 |
+
<h1 class="section-title">How Do We Measure Agent's Performance?</h1>
|
1087 |
+
<p class="code-intro">We developed the Tool Selection Quality metric to assess agents' tool call performance, evaluating tool selection accuracy and effectiveness of parameter usage. This is an example code for evaluating an LLM with a dataset with Galileo's Tool Selection Quality.</p>
|
1088 |
<div class="code-block">
|
1089 |
|
1090 |
<pre>
|
|
|
1103 |
scorers=[chainpoll_tool_selection_scorer],
|
1104 |
)
|
1105 |
|
1106 |
+
llm = llm_handler.get_llm(model, temperature=0.0, max_tokens=4000) # llm_handler is a custom handler for LLMs
|
1107 |
+
|
1108 |
system_msg = {
|
1109 |
"role": "system",
|
1110 |
"content": 'Your job is to use the given tools to answer the query of human. If there is no relevant tool then reply with "I cannot answer the question with given tools". If tool is available but sufficient information is not available, then ask human to get the same. You can call as many tools as you want. Use multiple tools if needed. If the tools need to be called in a sequence then just call the first tool.',
|
1111 |
}
|
1112 |
|
1113 |
for row in df.itertuples():
|
1114 |
+
chain = llm.bind_tools(tools) # attach the tools
|
1115 |
outputs.append(
|
1116 |
chain.invoke(
|
1117 |
[system_msg, *row.conversation],
|