Post
2659
๐ฌ๐ผ๐ ๐ฑ๐ผ๐ป'๐ ๐ป๐ฒ๐ฒ๐ฑ "๐ณ๐๐ป๐ฐ๐๐ถ๐ผ๐ป ๐ฐ๐ฎ๐น๐น๐ถ๐ป๐ด ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด" ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ด๐ผ๐ผ๐ฑ ๐ฎ๐ด๐ฒ๐ป๐๐ โ
It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
๐ฆโโฌ Nexusflow/๐ก๐ฒ๐ ๐๐๐ฅ๐ฎ๐๐ฒ๐ป-๐ฉ๐ฎ-๐ญ๐ฏ๐
โ CohereForAI/๐ฐ๐ฐ๐ฎ๐ถ-๐ฐ๐ผ๐บ๐บ๐ฎ๐ป๐ฑ-๐ฟ-๐ฝ๐น๐๐
โต๏ธ mistralai/๐ ๐ถ๐ ๐๐ฟ๐ฎ๐น-๐ด๐ ๐ฎ๐ฎ๐-๐๐ป๐๐๐ฟ๐๐ฐ๐-๐๐ฌ.๐ญ
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".
Yet I discovered two things while improving Transformers Agents:
๐ง Even when used as JSON agents, these fine-tuned models don't perform very well
๐ ๐๐ค๐ค๐ ๐๐๐จ๐ ๐ข๐ค๐๐๐ก๐จ ๐ฅ๐๐ง๐๐ค๐ง๐ข ๐๐๐ฉ๐ฉ๐๐ง ๐ฌ๐๐ฉ๐๐ค๐ช๐ฉ ๐๐ฃ๐ฎ ๐๐๐ฃ๐-๐ฉ๐ช๐ฃ๐๐ฃ๐, ๐๐ช๐จ๐ฉ ๐ฅ๐ก๐๐๐ฃ ๐ฅ๐ง๐ค๐ข๐ฅ๐ฉ๐๐ฃ๐. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)
๐ The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: ๐ฐ๐๐๐๐๐ฟ๐๐๐๐๐๐๐ด๐๐๐๐ and ๐ฐ๐๐๐๐๐ด๐ก๐๐๐๐๐๐๐๐ด๐๐๐๐ are the ones caused by incorrect formatting.
โค As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!
The hardest thing to get right in an agent is still to ๐ฅ๐ก๐๐ฃ ๐๐ค๐ค๐ ๐ฉ๐๐จ๐ -๐จ๐ค๐ก๐ซ๐๐ฃ๐ ๐ฉ๐ง๐๐๐๐๐ฉ๐ค๐ง๐๐๐จ ๐ค๐ซ๐๐ง ๐จ๐๐ซ๐๐ง๐๐ก ๐จ๐ฉ๐๐ฅ๐จ.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra
It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
๐ฆโโฌ Nexusflow/๐ก๐ฒ๐ ๐๐๐ฅ๐ฎ๐๐ฒ๐ป-๐ฉ๐ฎ-๐ญ๐ฏ๐
โ CohereForAI/๐ฐ๐ฐ๐ฎ๐ถ-๐ฐ๐ผ๐บ๐บ๐ฎ๐ป๐ฑ-๐ฟ-๐ฝ๐น๐๐
โต๏ธ mistralai/๐ ๐ถ๐ ๐๐ฟ๐ฎ๐น-๐ด๐ ๐ฎ๐ฎ๐-๐๐ป๐๐๐ฟ๐๐ฐ๐-๐๐ฌ.๐ญ
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".
Yet I discovered two things while improving Transformers Agents:
๐ง Even when used as JSON agents, these fine-tuned models don't perform very well
๐ ๐๐ค๐ค๐ ๐๐๐จ๐ ๐ข๐ค๐๐๐ก๐จ ๐ฅ๐๐ง๐๐ค๐ง๐ข ๐๐๐ฉ๐ฉ๐๐ง ๐ฌ๐๐ฉ๐๐ค๐ช๐ฉ ๐๐ฃ๐ฎ ๐๐๐ฃ๐-๐ฉ๐ช๐ฃ๐๐ฃ๐, ๐๐ช๐จ๐ฉ ๐ฅ๐ก๐๐๐ฃ ๐ฅ๐ง๐ค๐ข๐ฅ๐ฉ๐๐ฃ๐. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)
๐ The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: ๐ฐ๐๐๐๐๐ฟ๐๐๐๐๐๐๐ด๐๐๐๐ and ๐ฐ๐๐๐๐๐ด๐ก๐๐๐๐๐๐๐๐ด๐๐๐๐ are the ones caused by incorrect formatting.
โค As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!
The hardest thing to get right in an agent is still to ๐ฅ๐ก๐๐ฃ ๐๐ค๐ค๐ ๐ฉ๐๐จ๐ -๐จ๐ค๐ก๐ซ๐๐ฃ๐ ๐ฉ๐ง๐๐๐๐๐ฉ๐ค๐ง๐๐๐จ ๐ค๐ซ๐๐ง ๐จ๐๐ซ๐๐ง๐๐ก ๐จ๐ฉ๐๐ฅ๐จ.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra