Nerdy Face
Enterprise
company
AI & ML interests
None defined yet.
nerdyface's activity
Post
286
Today we make the biggest release in smolagents so far: ๐๐ฒ ๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐, ๐๐ต๐ถ๐ฐ๐ต ๐ฎ๐น๐น๐ผ๐๐ ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น ๐๐ฒ๐ฏ ๐ฏ๐ฟ๐ผ๐๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐! ๐ฅณ
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
For more detail, read our announcement blog ๐ https://huggingface.co./blog/smolagents-can-see
The code for the web browser example is here ๐ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
For more detail, read our announcement blog ๐ https://huggingface.co./blog/smolagents-can-see
The code for the web browser example is here ๐ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
Post
1531
๐คฉwarmup -> stable -> decay leanring rate scheduler:
๐use the Stable Phase CheckPoints to Continue Training the model on Any New Dataset without spikes of the training!!!
JingzeShi/Doge-20M-checkpoint
JingzeShi/Doge-60M-checkpoint
๐use the Stable Phase CheckPoints to Continue Training the model on Any New Dataset without spikes of the training!!!
JingzeShi/Doge-20M-checkpoint
JingzeShi/Doge-60M-checkpoint
Post
1527
R1 is out! And with a lot of other R1 releated models...
Post
1986
Only a single RTX 4090 running model pre-training is really slow, even for small language models!!! (
JingzeShi/doge-slm-677fd879f8c4fd0f43e05458)
Post
985
we now have more than 2000 public AI models using ModelHubMixin๐ค
Post
1416
๐๐ปโโ๏ธ Hey there folks ,
Facebook AI just released JASCO models that make music stems .
you can try it out here : Tonic/audiocraft
hope you like it
Facebook AI just released JASCO models that make music stems .
you can try it out here : Tonic/audiocraft
hope you like it
Post
1096
๐ ๐ถ๐ป๐ถ๐ ๐ฎ๐
'๐ ๐ป๐ฒ๐ ๐ ๐ผ๐ ๐๐๐ ๐ฟ๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฆ๐ผ๐ป๐ป๐ฒ๐ ๐น๐ฒ๐๐ฒ๐น ๐๐ถ๐๐ต ๐ฐ๐ ๐๐ผ๐ธ๐ฒ๐ป๐ ๐ฐ๐ผ๐ป๐๐ฒ๐
๐ ๐น๐ฒ๐ป๐ด๐๐ต ๐ฅ
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here ๐ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐ MiniMaxAI/MiniMax-Text-01
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here ๐ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐ MiniMaxAI/MiniMax-Text-01
Post
2400
๐ช๐ฒ'๐๐ฒ ๐ท๐๐๐ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ญ.๐ฏ.๐ฌ ๐, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
The setup is very easy, in a few lines of code.
Find a tutorial here ๐ https://huggingface.co./docs/smolagents/tutorials/inspect_runs
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
The setup is very easy, in a few lines of code.
Find a tutorial here ๐ https://huggingface.co./docs/smolagents/tutorials/inspect_runs
Post
2344
๐๐ปโโ๏ธHey there folks , Open LLM Europe just released Lucie 7B-Instruct model , a billingual instruct model trained on open data ! You can check out my unofficial demo here while we wait for the official inference api from the group :
Tonic/Lucie-7B hope you like it ๐
Post
2921
๐ซ...And we're live!๐ซ Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co./blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha
https://huggingface.co./blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha
Post
3815
Published a new blogpost ๐
In this blogpost I have gone through the transformers' architecture emphasizing how shapes propagate throughout each layer.
๐ https://huggingface.co./blog/not-lain/tensor-dims
some interesting takeaways :
In this blogpost I have gone through the transformers' architecture emphasizing how shapes propagate throughout each layer.
๐ https://huggingface.co./blog/not-lain/tensor-dims
some interesting takeaways :
Post
614
๐ข๐ฆ-๐๐ฒ๐ป๐ฒ๐๐ถ๐: ๐ป๐ฒ๐ ๐ฟ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐๐ฒ๐ ๐ฎ ๐ป๐ผ๐๐ฒ๐น ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฑ๐ฎ๐๐ฎ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐บ๐ฒ๐๐ต๐ผ๐ฑ ๐ณ๐ผ๐ฟ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐๐ผ๐บ๐ฝ๐๐๐ฒ๐ฟ-๐จ๐๐ฒ-๐น๐ถ๐ธ๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐, ๐๐ถ๐๐ต ๐ถ๐บ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐! ๐ฅ
The main bottleneck in building GUI agents it to find training data.
GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach:
โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting
โฃ It then reverse-engineers high-level tasks from successful interaction patterns
โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality:
โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion
โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments:
โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Read the paper here ๐ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723)
The main bottleneck in building GUI agents it to find training data.
GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach:
โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting
โฃ It then reverse-engineers high-level tasks from successful interaction patterns
โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality:
โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion
โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments:
โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Read the paper here ๐ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2412.19723)
Post
5038
Since I published it on GitHub a few days ago,
Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
We will make it work better, and fully open. โจ
Sounds like something you'd like to do? Apply here ๐ https://apply.workable.com/huggingface/j/AF1D4E3FEB/
Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
We will make it work better, and fully open. โจ
Sounds like something you'd like to do? Apply here ๐ https://apply.workable.com/huggingface/j/AF1D4E3FEB/
jeffboudierย
posted
an
update
18 days ago
Post
543
NVIDIA just announced the Cosmos World Foundation Models, available on the Hub:
nvidia/cosmos-6751e884dc10e013a0a0d8e6
Cosmos is a family of pre-trained models purpose-built for generating physics-aware videos and world states to advance physical AI development.
The release includes Tokenizers nvidia/cosmos-tokenizer-672b93023add81b66a8ff8e6
Learn more in this great community article by @mingyuliutw and @PranjaliJoshi https://huggingface.co./blog/mingyuliutw/nvidia-cosmos
Cosmos is a family of pre-trained models purpose-built for generating physics-aware videos and world states to advance physical AI development.
The release includes Tokenizers nvidia/cosmos-tokenizer-672b93023add81b66a8ff8e6
Learn more in this great community article by @mingyuliutw and @PranjaliJoshi https://huggingface.co./blog/mingyuliutw/nvidia-cosmos
Post
4115
Cool to see
@ylecun
joining the top 10 of most followed on HF!
(and leaderboard by @mvaloatto is here: mvaloatto/TCTF)
(and leaderboard by @mvaloatto is here: mvaloatto/TCTF)
Post
2334
After 6 years, BERT, the workhorse of encoder models, finally gets a replacement: ๐ช๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐ฟ๐ป๐๐๐ฅ๐ง! ๐ค
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ:
๐๏ธ Architecture changes:
โ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
Read the blog post ๐ https://huggingface.co./blog/modernbert
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ:
๐๏ธ Architecture changes:
โ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
Read the blog post ๐ https://huggingface.co./blog/modernbert
Post
2495
๐๐ฎ๐ ๐ ๐ข๐ง๐ ๐
๐๐๐ ๐ซ๐๐ฅ๐๐๐ฌ๐๐ฌ ๐๐ข๐๐จ๐ญ๐ซ๐จ๐ง, ๐ ๐ฆ๐ข๐๐ซ๐จ๐ฌ๐๐จ๐ฉ๐ข๐ ๐ฅ๐ข๐ ๐ญ๐ก๐๐ญ ๐ฌ๐จ๐ฅ๐ฏ๐๐ฌ ๐๐๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ ๐ฉ๐๐ซ๐๐ฅ๐ฅ๐๐ฅ๐ข๐ณ๐๐ญ๐ข๐จ๐ง ๐ฅณ
๐ฐ๏ธ Llama-3.1-405B took 39 million GPU-hours to train, i.e. about 4.5 thousand years.
๐ด๐ป If they had needed all this time, we would have GPU stories from the time of Pharaoh ๐: "Alas, Lord of Two Lands, the shipment of counting-stones arriving from Cathay was lost to pirates, this shall delay the building of your computing temple by many moons "
๐ ๏ธ But instead, they just parallelized the training on 24k H100s, which made it take just a few months.
This required parallelizing across 4 dimensions: data, tensor, context, pipeline.
And it is infamously hard to do, making for bloated code repos that hold together only by magic.
๐ค ๐๐๐ ๐ป๐ผ๐ ๐๐ฒ ๐ฑ๐ผ๐ป'๐ ๐ป๐ฒ๐ฒ๐ฑ ๐ต๐๐ด๐ฒ ๐ฟ๐ฒ๐ฝ๐ผ๐ ๐ฎ๐ป๐๐บ๐ผ๐ฟ๐ฒ! Instead of building mega-training codes, Hugging Face colleagues cooked in the other direction, towards tiny 4D parallelism libs. A team has built Nanotron, already widely used in industry.
And now a team releases Picotron, a radical approach to code 4D Parallelism in just a few hundred lines of code, a real engineering prowess, making it much easier to understand what's actually happening!
โก ๐๐'๐ ๐๐ถ๐ป๐, ๐๐ฒ๐ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น:
Counting in MFU (Model FLOPs Utilization, how much the model actually uses all the compute potential), this lib reaches ~50% on SmolLM-1.7B model with 8 H100 GPUs, which is really close to what huge libs would reach. (Caution: the team is leading further benchmarks to verify this)
Go take a look ๐ https://github.com/huggingface/picotron/tree/main/picotron
๐ฐ๏ธ Llama-3.1-405B took 39 million GPU-hours to train, i.e. about 4.5 thousand years.
๐ด๐ป If they had needed all this time, we would have GPU stories from the time of Pharaoh ๐: "Alas, Lord of Two Lands, the shipment of counting-stones arriving from Cathay was lost to pirates, this shall delay the building of your computing temple by many moons "
๐ ๏ธ But instead, they just parallelized the training on 24k H100s, which made it take just a few months.
This required parallelizing across 4 dimensions: data, tensor, context, pipeline.
And it is infamously hard to do, making for bloated code repos that hold together only by magic.
๐ค ๐๐๐ ๐ป๐ผ๐ ๐๐ฒ ๐ฑ๐ผ๐ป'๐ ๐ป๐ฒ๐ฒ๐ฑ ๐ต๐๐ด๐ฒ ๐ฟ๐ฒ๐ฝ๐ผ๐ ๐ฎ๐ป๐๐บ๐ผ๐ฟ๐ฒ! Instead of building mega-training codes, Hugging Face colleagues cooked in the other direction, towards tiny 4D parallelism libs. A team has built Nanotron, already widely used in industry.
And now a team releases Picotron, a radical approach to code 4D Parallelism in just a few hundred lines of code, a real engineering prowess, making it much easier to understand what's actually happening!
โก ๐๐'๐ ๐๐ถ๐ป๐, ๐๐ฒ๐ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น:
Counting in MFU (Model FLOPs Utilization, how much the model actually uses all the compute potential), this lib reaches ~50% on SmolLM-1.7B model with 8 H100 GPUs, which is really close to what huge libs would reach. (Caution: the team is leading further benchmarks to verify this)
Go take a look ๐ https://github.com/huggingface/picotron/tree/main/picotron