Post
1612
After 6 years, BERT, the workhorse of encoder models, finally gets a replacement: ๐ช๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐ฟ๐ป๐๐๐ฅ๐ง! ๐ค
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ:
๐๏ธ Architecture changes:
โ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
Read the blog post ๐ https://huggingface.co./blog/modernbert
We talk a lot about โจGenerative AIโจ, meaning "Decoder version of the Transformers architecture", but this is only one of the ways to build LLMs: encoder models, that turn a sentence in a vector, are maybe even more widely used in industry than generative models.
The workhorse for this category has been BERT since its release in 2018 (that's prehistory for LLMs).
It's not a fancy 100B parameters supermodel (just a few hundred millions), but it's an excellent workhorse, kind of a Honda Civic for LLMs.
Many applications use BERT-family models - the top models in this category cumulate millions of downloads on the Hub.
โก๏ธ Now a collaboration between Answer.AI and LightOn just introduced BERT's replacement: ModernBERT.
๐ง๐;๐๐ฅ:
๐๏ธ Architecture changes:
โ First, standard modernizations:
- Rotary positional embeddings (RoPE)
- Replace GeLU with GeGLU,
- Use Flash Attention 2
โจ The team also introduced innovative techniques like alternating attention instead of full attention, and sequence packing to get rid of padding overhead.
๐ฅ As a result, the model tops the game of encoder models:
It beats previous standard DeBERTaV3 for 1/5th the memory footprint, and runs 4x faster!
Read the blog post ๐ https://huggingface.co./blog/modernbert