Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update Jul 2
Post
2659
๐—ฌ๐—ผ๐˜‚ ๐—ฑ๐—ผ๐—ป'๐˜ ๐—ป๐—ฒ๐—ฒ๐—ฑ "๐—ณ๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ฐ๐—ฎ๐—น๐—น๐—ถ๐—ป๐—ด ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด" ๐˜๐—ผ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—ด๐—ผ๐—ผ๐—ฑ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ โ›”

It's trendy to share models "fine-tuned for function calling"; but from my observations, this fine-tuning is not necessary or sufficient to build good agent systems.
To name only a few:
๐Ÿฆโ€โฌ› Nexusflow/๐—ก๐—ฒ๐˜…๐˜‚๐˜€๐—ฅ๐—ฎ๐˜ƒ๐—ฒ๐—ป-๐—ฉ๐Ÿฎ-๐Ÿญ๐Ÿฏ๐—•
โŒ˜ CohereForAI/๐—ฐ๐Ÿฐ๐—ฎ๐—ถ-๐—ฐ๐—ผ๐—บ๐—บ๐—ฎ๐—ป๐—ฑ-๐—ฟ-๐—ฝ๐—น๐˜‚๐˜€
โ›ต๏ธ mistralai/๐— ๐—ถ๐˜…๐˜๐—ฟ๐—ฎ๐—น-๐Ÿด๐˜…๐Ÿฎ๐Ÿฎ๐—•-๐—œ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜-๐˜ƒ๐Ÿฌ.๐Ÿญ
"Fine-tuned for function-calling" generally means "fine-tuned to generate function calls in correct JSON for extremely simple tasks". In other terms, it means "improve the formatting of the tool calls".

Yet I discovered two things while improving Transformers Agents:
๐Ÿง Even when used as JSON agents, these fine-tuned models don't perform very well
๐Ÿ… ๐™‚๐™ค๐™ค๐™™ ๐™—๐™–๐™จ๐™š ๐™ข๐™ค๐™™๐™š๐™ก๐™จ ๐™ฅ๐™š๐™ง๐™›๐™ค๐™ง๐™ข ๐™—๐™š๐™ฉ๐™ฉ๐™š๐™ง ๐™ฌ๐™ž๐™ฉ๐™๐™ค๐™ช๐™ฉ ๐™–๐™ฃ๐™ฎ ๐™›๐™ž๐™ฃ๐™š-๐™ฉ๐™ช๐™ฃ๐™ž๐™ฃ๐™œ, ๐™Ÿ๐™ช๐™จ๐™ฉ ๐™ฅ๐™ก๐™–๐™ž๐™ฃ ๐™ฅ๐™ง๐™ค๐™ข๐™ฅ๐™ฉ๐™ž๐™ฃ๐™œ. (Llama-3-70B-Instruct, GPT-4o, Claude-3.5-Sonnet)

๐Ÿ‘‡ The graph below shows the count of errors for my GPT-4o validation run on the GAIA benchmark: ๐™ฐ๐š๐šŽ๐š—๐š๐™ฟ๐šŠ๐š›๐šœ๐š’๐š—๐š๐™ด๐š›๐š›๐š˜๐š› and ๐™ฐ๐š๐šŽ๐š—๐š๐™ด๐šก๐šŽ๐šŒ๐šž๐š๐š’๐š˜๐š—๐™ด๐š›๐š›๐š˜๐š› are the ones caused by incorrect formatting.
โžค As you can see, their count is already close to 0!
And given that GPT-4o is certainly not fine-tuned for our Code tool calling format, this shows that "function calling fine-tuning" is not necessary!

The hardest thing to get right in an agent is still to ๐™ฅ๐™ก๐™–๐™ฃ ๐™œ๐™ค๐™ค๐™™ ๐™ฉ๐™–๐™จ๐™ -๐™จ๐™ค๐™ก๐™ซ๐™ž๐™ฃ๐™œ ๐™ฉ๐™ง๐™–๐™Ÿ๐™š๐™˜๐™ฉ๐™ค๐™ง๐™ž๐™š๐™จ ๐™ค๐™ซ๐™š๐™ง ๐™จ๐™š๐™ซ๐™š๐™ง๐™–๐™ก ๐™จ๐™ฉ๐™š๐™ฅ๐™จ.
To improve this, we could:
- Use more powerful base models
- Make tool calling datasets with complex solving trajectories
- Use RL! cc @lvwerra
This comment has been hidden

Perhaps using GPT-4o for evaluation is not the best way to do it?

ยท

It's not using GPT-4o for evaluation, evaluation is done with exact string match!