Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeep 
posted an update about 5 hours ago
Post
169
Exciting breakthrough in Text Embeddings: Introducing LENS (Lexicon-based EmbeddiNgS)!

A team of researchers from University of Amsterdam, University of Technology Sydney, and Tencent have developed a groundbreaking approach that outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB).

>> Key Technical Innovations:
- LENS consolidates vocabulary space through token embedding clustering, addressing the inherent redundancy in LLM tokenizers
- Implements bidirectional attention and innovative pooling strategies to unlock the full potential of LLMs
- Each dimension corresponds to token clusters instead of individual tokens, creating more coherent and compact embeddings
- Achieves competitive performance with just 4,000-8,000 dimensional embeddings, matching the size of dense counterparts

>> Under the Hood:
The framework applies KMeans clustering to token embeddings from the language modeling head, replacing original embeddings with cluster centroids. This reduces dimensionality while preserving semantic relationships.

>> Results:
- Outperforms dense embeddings on MTEB benchmark
- Achieves state-of-the-art performance when combined with dense embeddings on BEIR retrieval tasks
- Demonstrates superior performance across clustering, classification, and retrieval tasks

This work opens new possibilities for more efficient and interpretable text embeddings. The code will be available soon.