Xargs Lynx

xargs01
ยท

AI & ML interests

None yet

Recent Activity

liked a Space 15 days ago
gokaygokay/Flux-TRELLIS
liked a Space 15 days ago
Wuvin/Unique3D
liked a Space 15 days ago
Kwai-Kolors/Kolors-Portrait-with-Flux
View all activity

Organizations

None yet

xargs01's activity

reacted to m-ric's post with ๐Ÿ‘€ 23 days ago
view post
Post
1263
๐— ๐—ถ๐—ป๐—ถ๐— ๐—ฎ๐˜…'๐˜€ ๐—ป๐—ฒ๐˜„ ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐—  ๐—ฟ๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ฒ๐˜€ ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ-๐—ฆ๐—ผ๐—ป๐—ป๐—ฒ๐˜ ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐˜„๐—ถ๐˜๐—ต ๐Ÿฐ๐—  ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต ๐Ÿ’ฅ

This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿ—๏ธ MoE with novel hybrid attention:
โ€ฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โ€ฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers

๐Ÿ† Outperforms leading models across benchmarks while offering vastly longer context:
โ€ฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โ€ฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)

๐Ÿ”ฌ Technical innovations enable efficient scaling:
โ€ฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โ€ฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)

๐ŸŽฏ Thorough training strategy:
โ€ฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!

Overall, not only is the model impressive, but the technical paper is also really interesting! ๐Ÿ“
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.

Read it in full here ๐Ÿ‘‰ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐Ÿ‘‰ MiniMaxAI/MiniMax-Text-01
reacted to m-ric's post with ๐Ÿ‘ about 2 months ago
view post
Post
2366
After 6 years, BERT, the workhorse of encoder models, finally gets a replacement: ๐—ช๐—ฒ๐—น๐—ฐ๐—ผ๐—บ๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—ฟ๐—ป๐—•๐—˜๐—ฅ๐—ง! ๐Ÿค—

We talk a lot about โœจGenerative AIโœจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.

The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).

It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.

Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.

โžก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.

๐—ง๐—Ÿ;๐——๐—ฅ:
๐Ÿ›๏ธ Architecture changes:
โ‡’ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โœจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.

๐Ÿฅ‡ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!

Read the blog post ๐Ÿ‘‰ https://huggingface.co./blog/modernbert
  • 1 reply
ยท