Hugging Test Lab

company
Activity Feed

AI & ML interests

Offering a great user experience for organizations on Hugging Face!

Recent Activity

HF-test-lab's activity

alvarobarttย 
posted an update 3 days ago
view post
Post
2600
๐Ÿ”ฅ Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
m-ricย 
posted an update 4 days ago
view post
Post
4107
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐Ÿ”ฅ

Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.

To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.

๐ŸŽฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โ€œAttributeTreeโ€ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!

๐Ÿ“ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.

As a result, their system outperforms previous approaches by far!

As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐Ÿ†

I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐Ÿ‘‰ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐Ÿ‘‰ http://www.surveyx.cn/
m-ricย 
posted an update 10 days ago
view post
Post
2908
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐Ÿคฏ

Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โ€”no huge datasets or RL procedures needed.

Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.

โšก The Less-is-More Reasoning Hypothesis:
โ€ฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โ€ฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills

โžก๏ธ Core techniques:
โ€ฃ High-quality reasoning chains with self-verification steps
โ€ฃ 817 handpicked problems that encourage deeper reasoning
โ€ฃ Enough inference-time computation to allow extended reasoning

๐Ÿ’ช Efficiency gains:
โ€ฃ Only 817 examples instead of 100k+
โ€ฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data

This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐Ÿš€

Read the full paper here ๐Ÿ‘‰ย  LIMO: Less is More for Reasoning (2502.03387)
m-ricย 
posted an update 14 days ago
view post
Post
2818
๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜: you can now share agents to the Hub! ๐Ÿฅณ๐Ÿฅณ

And any agent pushed to Hub get a cool Space interface to directly chat with it.

This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.

Go try it out! ๐Ÿ‘‰ https://github.com/huggingface/smolagents
  • 2 replies
ยท
m-ricย 
posted an update 14 days ago
view post
Post
2452
For those who haven't come across it yet, here's a handy trick to discuss an entire GitHub repo with an LLM:

=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs
m-ricย 
posted an update 16 days ago
view post
Post
4771
"๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐˜„๐—ถ๐—น๐—น ๐—ฏ๐—ฒ ๐˜๐—ต๐—ฒ ๐˜†๐—ฒ๐—ฎ๐—ฟ ๐—ผ๐—ณ ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€": this statement has often been made, here are numbers to support it.

I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.

And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.
m-ricย 
posted an update 21 days ago
view post
Post
3697
๐—”๐—ฑ๐˜†๐—ฒ๐—ป'๐˜€ ๐—ป๐—ฒ๐˜„ ๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐——๐—ฒ๐—ฒ๐—ฝ๐—ฆ๐—ฒ๐—ฒ๐—ธ-๐—ฅ๐Ÿญ ๐˜€๐˜๐—ฟ๐˜‚๐—ด๐—ด๐—น๐—ฒ๐˜€ ๐—ผ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ๐˜€! โŒ

โžก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.

So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.

๐Ÿ‘Ž But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.

๐Ÿง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.

It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐Ÿš€

Read more in the blog post ๐Ÿ‘‰ https://huggingface.co./blog/dabstep
m-ricย 
posted an update 24 days ago
view post
Post
9664
Introducing ๐—ผ๐—ฝ๐—ฒ๐—ป ๐——๐—ฒ๐—ฒ๐—ฝ-๐—ฅ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต by Hugging Face! ๐Ÿ’ฅ

OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.

โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ

โžก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...

We aimed for the best performance: are the agent's answers really rigorous?

On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โžก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐Ÿ’ช๐Ÿ’ช

And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !

Read the blog post ๐Ÿ‘‰ https://huggingface.co./blog/open-deep-research
m-ricย 
posted an update 28 days ago
view post
Post
3105
Now you can launch a code agent directly from your terminal!
โœจ ๐šœ๐š–๐š˜๐š•๐šŠ๐š๐šŽ๐š—๐š "๐šˆ๐š˜๐šž๐š› ๐š๐šŠ๐šœ๐š”" directly launches a CodeAgent
โ–ถ๏ธ This also works with web agents (replace ๐šœ๐š–๐š˜๐š•๐šŠ๐š๐šŽ๐š—๐š with ๐š ๐šŽ๐š‹๐šŠ๐š๐šŽ๐š—๐š) thanks to @merve !

๐Ÿ’พ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐šŠ๐š๐šŽ๐š—๐š.๐š›๐šŽ๐š™๐š•๐šŠ๐šข(), thank you @clefourrier !

Check the release notes here ๐Ÿ‘‰ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
pagezyhfย 
posted an update 29 days ago
view post
Post
1685
We published https://huggingface.co./blog/deepseek-r1-aws!

If you are using AWS, give a read. It is a running document to showcase how to deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS.

We're working hard to enable all the scenarios, whether you want to deploy to Inference Endpoints, Sagemaker or EC2; with GPUs or with Trainium & Inferentia.

We have full support for the distilled models, DeepSeek-R1 support is coming soon!! I'll keep you posted.

Cheers
  • 1 reply
ยท
m-ricย 
posted an update about 1 month ago
view post
Post
4049
๐—ง๐—ต๐—ฒ ๐—›๐˜‚๐—ฏ ๐˜„๐—ฒ๐—น๐—ฐ๐—ผ๐—บ๐—ฒ๐˜€ ๐—ฒ๐˜…๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ถ๐—ฑ๐—ฒ๐—ฟ๐˜€!

โœ… Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.

Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)

Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.

๐Ÿ’ธ Also, PRO users get 2$ inference credits per month!

Read more in the announcement ๐Ÿ‘‰ https://huggingface.co./blog/inference-providers
  • 1 reply
ยท
m-ricย 
posted an update about 1 month ago
view post
Post
3265
Today we make the biggest release in smolagents so far: ๐˜„๐—ฒ ๐—ฒ๐—ป๐—ฎ๐—ฏ๐—น๐—ฒ ๐˜ƒ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€, ๐˜„๐—ต๐—ถ๐—ฐ๐—ต ๐—ฎ๐—น๐—น๐—ผ๐˜„๐˜€ ๐˜๐—ผ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ๐—ณ๐˜‚๐—น ๐˜„๐—ฒ๐—ฏ ๐—ฏ๐—ฟ๐—ผ๐˜„๐˜€๐—ถ๐—ป๐—ด ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€! ๐Ÿฅณ

Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.

The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !

Go try it out, it's the most cracked agentic stuff I've seen in a while ๐Ÿคฏ (well, along with OpenAI's Operator who beat us by one day)

For more detail, read our announcement blog ๐Ÿ‘‰ https://huggingface.co./blog/smolagents-can-see
The code for the web browser example is here ๐Ÿ‘‰ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
ยท
ariG23498ย 
posted an update about 1 month ago
florentgbelidjiย 
posted an update about 1 month ago
view post
Post
1484
๐—ฃ๐—น๐—ฎ๐—ป๐—ป๐—ถ๐—ป๐—ด ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—ก๐—ฒ๐˜…๐˜ ๐—ฆ๐—ธ๐—ถ ๐—”๐—ฑ๐˜ƒ๐—ฒ๐—ป๐˜๐˜‚๐—ฟ๐—ฒ ๐—๐˜‚๐˜€๐˜ ๐—š๐—ผ๐˜ ๐—ฆ๐—บ๐—ฎ๐—ฟ๐˜๐—ฒ๐—ฟ: ๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ถ๐—ป๐—ด ๐—”๐—น๐—ฝ๐—ถ๐—ป๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜!๐Ÿ”๏ธโ›ท๏ธ

With the big hype around AI agents these days, I couldnโ€™t stop thinking about how AI agents could truly enhance real-world activities.
What sort of applications could we build with those AI agents: agentic RAG? self-correcting text-to-sql? Nah, boringโ€ฆ

Passionate about outdoors, Iโ€™ve always dreamed of a tool that could simplify planning mountain trips while accounting for all potential risks. Thatโ€™s why I built ๐—”๐—น๐—ฝ๐—ถ๐—ป๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜, a smart assistant designed to help you plan safe and enjoyable itineraries in the French Alps and Pyrenees.

Built using Hugging Face's ๐˜€๐—บ๐—ผ๐—น๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ library, Alpine Agent combines the power of AI with trusted resources like ๐˜š๐˜ฌ๐˜ช๐˜ต๐˜ฐ๐˜ถ๐˜ณ.๐˜ง๐˜ณ (https://skitour.fr/) and METEO FRANCE. Whether itโ€™s suggesting a route with moderate difficulty or analyzing avalanche risks and weather conditions, this agent dynamically integrates data to deliver personalized recommendations.

In my latest blog post, I share how I developed this projectโ€”from defining tools and integrating APIs to selecting the best LLMs like ๐˜˜๐˜ธ๐˜ฆ๐˜ฏ2.5-๐˜Š๐˜ฐ๐˜ฅ๐˜ฆ๐˜ณ-32๐˜‰-๐˜๐˜ฏ๐˜ด๐˜ต๐˜ณ๐˜ถ๐˜ค๐˜ต, ๐˜“๐˜ญ๐˜ข๐˜ฎ๐˜ข-3.3-70๐˜‰-๐˜๐˜ฏ๐˜ด๐˜ต๐˜ณ๐˜ถ๐˜ค๐˜ต, or ๐˜Ž๐˜—๐˜›-4.

โ›ท๏ธ Curious how AI can enhance adventure planning?โ€จTry the app and share your thoughts: florentgbelidji/alpine-agent

๐Ÿ‘‰ Want to build your own agents? Whether for cooking, sports training, or other passions, the possibilities are endless. Check out the blog post to learn more: https://huggingface.co./blog/florentgbelidji/alpine-agent

Many thanks to @m-ric for helping on building this tool with smolagents!
  • 1 reply
ยท
ariG23498ย 
posted an update about 1 month ago
m-ricย 
posted an update about 1 month ago
view post
Post
1369
๐— ๐—ถ๐—ป๐—ถ๐— ๐—ฎ๐˜…'๐˜€ ๐—ป๐—ฒ๐˜„ ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐—  ๐—ฟ๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ฒ๐˜€ ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ-๐—ฆ๐—ผ๐—ป๐—ป๐—ฒ๐˜ ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐˜„๐—ถ๐˜๐—ต ๐Ÿฐ๐—  ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต ๐Ÿ’ฅ

This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿ—๏ธ MoE with novel hybrid attention:
โ€ฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โ€ฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers

๐Ÿ† Outperforms leading models across benchmarks while offering vastly longer context:
โ€ฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โ€ฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)

๐Ÿ”ฌ Technical innovations enable efficient scaling:
โ€ฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โ€ฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)

๐ŸŽฏ Thorough training strategy:
โ€ฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!

Overall, not only is the model impressive, but the technical paper is also really interesting! ๐Ÿ“
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.

Read it in full here ๐Ÿ‘‰ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐Ÿ‘‰ MiniMaxAI/MiniMax-Text-01
m-ricย 
posted an update about 1 month ago
view post
Post
2541
๐—ช๐—ฒ'๐˜ƒ๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐—บ๐—ผ๐—น๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜ƒ๐Ÿญ.๐Ÿฏ.๐Ÿฌ ๐Ÿš€, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐Ÿ“Š

This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.

The setup is very easy, in a few lines of code.

Find a tutorial here ๐Ÿ‘‰ https://huggingface.co./docs/smolagents/tutorials/inspect_runs
  • 5 replies
ยท
MoritzLaurerย 
posted an update about 1 month ago
view post
Post
2506
Microsoft's rStar-Math paper claims that ๐Ÿค ~7B models can match the math skills of o1 using clever train- and test-time techniques. You can now download their prompt templates from Hugging Face !

๐Ÿ“ The paper introduces rStar-Math, which claims to rival OpenAI o1's math reasoning capabilities by integrating Monte Carlo Tree Search (MCTS) with step-by-step verified reasoning trajectories.
๐Ÿค– A Process Preference Model (PPM) enables fine-grained evaluation of intermediate steps, improving training data quality.
๐Ÿงช The system underwent four rounds of self-evolution, progressively refining both the policy and reward models to tackle Olympiad-level math problemsโ€”without GPT-4-based data distillation.
๐Ÿ’พ While we wait for the release of code and datasets, you can already download the prompts they used from the HF Hub!

Details and links here ๐Ÿ‘‡
Prompt-templates docs: https://moritzlaurer.github.io/prompt_templates/
Templates on the hub: MoritzLaurer/rstar-math-prompts
Prompt-templates collection: MoritzLaurer/prompt-templates-6776aa0b0b8a923957920bb4
Paper: https://arxiv.org/pdf/2501.04519
pagezyhfย 
posted an update about 2 months ago