smolagents

community

AI & ML interests

None defined yet.

Recent Activity

m-ricย  updated a Space about 7 hours ago
smolagents/smolagents_leaderboard
m-ricย  published a Space about 8 hours ago
smolagents/smolagents_leaderboard
m-ricย  updated a dataset about 10 hours ago
smolagents/answers
View all activity

smolagents's activity

m-ricย 
posted an update 4 days ago
view post
Post
4126
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐Ÿ”ฅ

Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.

To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.

๐ŸŽฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โ€œAttributeTreeโ€ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!

๐Ÿ“ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.

As a result, their system outperforms previous approaches by far!

As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐Ÿ†

I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐Ÿ‘‰ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐Ÿ‘‰ http://www.surveyx.cn/
m-ricย 
posted an update 10 days ago
view post
Post
2912
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐Ÿคฏ

Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โ€”no huge datasets or RL procedures needed.

Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.

โšก The Less-is-More Reasoning Hypothesis:
โ€ฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โ€ฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills

โžก๏ธ Core techniques:
โ€ฃ High-quality reasoning chains with self-verification steps
โ€ฃ 817 handpicked problems that encourage deeper reasoning
โ€ฃ Enough inference-time computation to allow extended reasoning

๐Ÿ’ช Efficiency gains:
โ€ฃ Only 817 examples instead of 100k+
โ€ฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data

This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐Ÿš€

Read the full paper here ๐Ÿ‘‰ย  LIMO: Less is More for Reasoning (2502.03387)
m-ricย 
posted an update 14 days ago
view post
Post
2822
๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜: you can now share agents to the Hub! ๐Ÿฅณ๐Ÿฅณ

And any agent pushed to Hub get a cool Space interface to directly chat with it.

This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.

Go try it out! ๐Ÿ‘‰ https://github.com/huggingface/smolagents
  • 2 replies
ยท
m-ricย 
posted an update 14 days ago
view post
Post
2455
For those who haven't come across it yet, here's a handy trick to discuss an entire GitHub repo with an LLM:

=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs
m-ricย 
posted an update 16 days ago
view post
Post
4774
"๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐˜„๐—ถ๐—น๐—น ๐—ฏ๐—ฒ ๐˜๐—ต๐—ฒ ๐˜†๐—ฒ๐—ฎ๐—ฟ ๐—ผ๐—ณ ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€": this statement has often been made, here are numbers to support it.

I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.

And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.
m-ricย 
posted an update 21 days ago
view post
Post
3699
๐—”๐—ฑ๐˜†๐—ฒ๐—ป'๐˜€ ๐—ป๐—ฒ๐˜„ ๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐——๐—ฒ๐—ฒ๐—ฝ๐—ฆ๐—ฒ๐—ฒ๐—ธ-๐—ฅ๐Ÿญ ๐˜€๐˜๐—ฟ๐˜‚๐—ด๐—ด๐—น๐—ฒ๐˜€ ๐—ผ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ๐˜€! โŒ

โžก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.

So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.

๐Ÿ‘Ž But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.

๐Ÿง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.

It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐Ÿš€

Read more in the blog post ๐Ÿ‘‰ https://huggingface.co./blog/dabstep
albertvillanovaย 
posted an update 24 days ago
view post
Post
3559
๐Ÿš€ Introducing @huggingface Open Deep-Research๐Ÿ’ฅ

In just 24 hours, we built an open-source agent that:
โœ… Autonomously browse the web
โœ… Search, scroll & extract info
โœ… Download & manipulate files
โœ… Run calculations on data

55% on GAIA validation set! Help us improve it!๐Ÿ’ก
https://huggingface.co./blog/open-deep-research
  • 3 replies
ยท
m-ricย 
posted an update 24 days ago
view post
Post
9664
Introducing ๐—ผ๐—ฝ๐—ฒ๐—ป ๐——๐—ฒ๐—ฒ๐—ฝ-๐—ฅ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต by Hugging Face! ๐Ÿ’ฅ

OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.

โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ

โžก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...

We aimed for the best performance: are the agent's answers really rigorous?

On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โžก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐Ÿ’ช๐Ÿ’ช

And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !

Read the blog post ๐Ÿ‘‰ https://huggingface.co./blog/open-deep-research
m-ricย 
posted an update 28 days ago
view post
Post
3105
Now you can launch a code agent directly from your terminal!
โœจ ๐šœ๐š–๐š˜๐š•๐šŠ๐š๐šŽ๐š—๐š "๐šˆ๐š˜๐šž๐š› ๐š๐šŠ๐šœ๐š”" directly launches a CodeAgent
โ–ถ๏ธ This also works with web agents (replace ๐šœ๐š–๐š˜๐š•๐šŠ๐š๐šŽ๐š—๐š with ๐š ๐šŽ๐š‹๐šŠ๐š๐šŽ๐š—๐š) thanks to @merve !

๐Ÿ’พ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐šŠ๐š๐šŽ๐š—๐š.๐š›๐šŽ๐š™๐š•๐šŠ๐šข(), thank you @clefourrier !

Check the release notes here ๐Ÿ‘‰ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
m-ricย 
posted an update about 1 month ago
view post
Post
4049
๐—ง๐—ต๐—ฒ ๐—›๐˜‚๐—ฏ ๐˜„๐—ฒ๐—น๐—ฐ๐—ผ๐—บ๐—ฒ๐˜€ ๐—ฒ๐˜…๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ถ๐—ฑ๐—ฒ๐—ฟ๐˜€!

โœ… Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.

Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)

Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.

๐Ÿ’ธ Also, PRO users get 2$ inference credits per month!

Read more in the announcement ๐Ÿ‘‰ https://huggingface.co./blog/inference-providers
  • 1 reply
ยท
m-ricย 
posted an update about 1 month ago
view post
Post
3266
Today we make the biggest release in smolagents so far: ๐˜„๐—ฒ ๐—ฒ๐—ป๐—ฎ๐—ฏ๐—น๐—ฒ ๐˜ƒ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€, ๐˜„๐—ต๐—ถ๐—ฐ๐—ต ๐—ฎ๐—น๐—น๐—ผ๐˜„๐˜€ ๐˜๐—ผ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ๐—ณ๐˜‚๐—น ๐˜„๐—ฒ๐—ฏ ๐—ฏ๐—ฟ๐—ผ๐˜„๐˜€๐—ถ๐—ป๐—ด ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€! ๐Ÿฅณ

Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.

The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !

Go try it out, it's the most cracked agentic stuff I've seen in a while ๐Ÿคฏ (well, along with OpenAI's Operator who beat us by one day)

For more detail, read our announcement blog ๐Ÿ‘‰ https://huggingface.co./blog/smolagents-can-see
The code for the web browser example is here ๐Ÿ‘‰ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
ยท
m-ricย 
posted an update about 1 month ago
view post
Post
1369
๐— ๐—ถ๐—ป๐—ถ๐— ๐—ฎ๐˜…'๐˜€ ๐—ป๐—ฒ๐˜„ ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐—  ๐—ฟ๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ฒ๐˜€ ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ-๐—ฆ๐—ผ๐—ป๐—ป๐—ฒ๐˜ ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐˜„๐—ถ๐˜๐—ต ๐Ÿฐ๐—  ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต ๐Ÿ’ฅ

This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿ—๏ธ MoE with novel hybrid attention:
โ€ฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โ€ฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers

๐Ÿ† Outperforms leading models across benchmarks while offering vastly longer context:
โ€ฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โ€ฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)

๐Ÿ”ฌ Technical innovations enable efficient scaling:
โ€ฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โ€ฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)

๐ŸŽฏ Thorough training strategy:
โ€ฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!

Overall, not only is the model impressive, but the technical paper is also really interesting! ๐Ÿ“
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.

Read it in full here ๐Ÿ‘‰ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐Ÿ‘‰ MiniMaxAI/MiniMax-Text-01
m-ricย 
posted an update about 1 month ago
view post
Post
2541
๐—ช๐—ฒ'๐˜ƒ๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐—บ๐—ผ๐—น๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜ƒ๐Ÿญ.๐Ÿฏ.๐Ÿฌ ๐Ÿš€, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐Ÿ“Š

This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.

The setup is very easy, in a few lines of code.

Find a tutorial here ๐Ÿ‘‰ https://huggingface.co./docs/smolagents/tutorials/inspect_runs
  • 5 replies
ยท