|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- answerdotai/ModernBERT-base |
|
pipeline_tag: sentence-similarity |
|
library_name: transformers |
|
tags: |
|
- sentence-transformers |
|
--- |
|
|
|
# gte-modernbert-base |
|
|
|
We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models. |
|
|
|
The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation. |
|
|
|
## Model Overview |
|
|
|
- Developed by: Tongyi Lab, Alibaba Group |
|
- Model Type: Text Embedding |
|
- Primary Language: English |
|
- Model Size: 149M |
|
- Max Input Length: 8192 tokens |
|
- Output Dimension: 768 |
|
|
|
### Model list |
|
|
|
|
|
| Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR | |
|
|:--------------------------------------------------------------------------------------:|:--------:|:----------------------:|:----------:|:---------------:|:---------:|:-------:|:----:|:----:|:----:| |
|
| [`gte-modernbert-base`](https://huggingface.co./Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 | 64.38 | 55.33 | 87.57 | 79.31 | |
|
| [`gte-reranker-modernbert-base`](https://huggingface.co./Alibaba-NLP/gte-reranker-modernbert-base) | English | text reranker | 149M | 8192 | - | - | 56.19 | 90.68 | 79.99 | |
|
|
|
## Usage |
|
|
|
> [!TIP] |
|
> For `transformers` and `sentence-transformers`, if your GPU supports it, the efficient Flash Attention 2 will be used automatically if you have `flash_attn` installed. It is not mandatory. |
|
> |
|
> ```bash |
|
> pip install flash_attn |
|
> ``` |
|
|
|
Use with `transformers` |
|
|
|
```python |
|
# Requires transformers>=4.48.0 |
|
|
|
import torch.nn.functional as F |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
input_texts = [ |
|
"what is the capital of China?", |
|
"how to implement quick sort in python?", |
|
"Beijing", |
|
"sorting algorithms" |
|
] |
|
|
|
model_path = "Alibaba-NLP/gte-modernbert-base" |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
model = AutoModel.from_pretrained(model_path) |
|
|
|
# Tokenize the input texts |
|
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt') |
|
|
|
outputs = model(**batch_dict) |
|
embeddings = outputs.last_hidden_state[:, 0] |
|
|
|
# (Optionally) normalize embeddings |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
scores = (embeddings[:1] @ embeddings[1:].T) * 100 |
|
print(scores.tolist()) |
|
# [[42.89073944091797, 71.30911254882812, 33.664554595947266]] |
|
``` |
|
|
|
Use with `sentence-transformers`: |
|
|
|
```python |
|
# Requires transformers>=4.48.0 |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
|
|
input_texts = [ |
|
"what is the capital of China?", |
|
"how to implement quick sort in python?", |
|
"Beijing", |
|
"sorting algorithms" |
|
] |
|
|
|
model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base") |
|
embeddings = model.encode(input_texts) |
|
print(embeddings.shape) |
|
# (4, 768) |
|
|
|
similarities = cos_sim(embeddings[0], embeddings[1:]) |
|
print(similarities) |
|
# tensor([[0.4289, 0.7131, 0.3366]]) |
|
``` |
|
|
|
Use with `transformers.js`: |
|
|
|
```js |
|
// npm i @xenova/transformers |
|
import { pipeline, dot } from '@xenova/transformers'; |
|
|
|
// Create feature extraction pipeline |
|
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', { |
|
quantized: false, // Comment out this line to use the quantized version |
|
}); |
|
|
|
// Generate sentence embeddings |
|
const sentences = [ |
|
"what is the capital of China?", |
|
"how to implement quick sort in python?", |
|
"Beijing", |
|
"sorting algorithms" |
|
] |
|
const output = await extractor(sentences, { normalize: true, pooling: 'cls' }); |
|
|
|
// Compute similarity scores |
|
const [source_embeddings, ...document_embeddings ] = output.tolist(); |
|
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x)); |
|
console.log(similarities); |
|
``` |
|
|
|
## Training Details |
|
|
|
The `gte-modernbert` series of models follows the training scheme of the previous [GTE models](https://huggingface.co./collections/Alibaba-NLP/gte-models-6680f0b13f885cb431e6d469), with the only difference being that the pre-training language model base has been replaced from [GTE-MLM](https://huggingface.co./Alibaba-NLP/gte-en-mlm-base) to [ModernBert](https://huggingface.co./answerdotai/ModernBERT-base). For more training details, please refer to our paper: [mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval](https://aclanthology.org/2024.emnlp-industry.103/) |
|
|
|
## Evaluation |
|
|
|
### MTEB |
|
|
|
The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co./spaces/mteb/leaderboard). Given that all models in the `gte-modernbert` series have a size of less than 1B parameters, we focused exclusively on the results of models under 1B from the MTEB leaderboard. |
|
|
|
| Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) | |
|
|:------------------------------------------------------------------------------------------------:|:--------------:|:---------:|:---------------:|:------------:|:-----------:|:---:|:---:|:---:|:---:|:-----------:|:--------:| |
|
| [mxbai-embed-large-v1](https://huggingface.co./mixedbread-ai/mxbai-embed-large-v1) | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 | |
|
| [multilingual-e5-large-instruct](https://huggingface.co./intfloat/multilingual-e5-large-instruct) | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 | |
|
| [bge-large-en-v1.5](https://huggingface.co./BAAI/bge-large-en-v1.5) | 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 | |
|
| [gte-base-en-v1.5](https://huggingface.co./Alibaba-NLP/gte-base-en-v1.5) | 137 | 768 | 8192 | 64.11 | 77.17 | 46.82 | 85.33 | 57.66 | 54.09 | 81.97 | 31.17 | |
|
| [bge-base-en-v1.5](https://huggingface.co./BAAI/bge-base-en-v1.5) | 109 | 768 | 512 | 63.55 | 75.53 | 45.77 | 86.55 | 58.86 | 53.25 | 82.4 | 31.07 | |
|
| [gte-large-en-v1.5](https://huggingface.co./Alibaba-NLP/gte-large-en-v1.5) | 409 | 1024 | 8192 | 65.39 | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 | |
|
| [modernbert-embed-base](https://huggingface.co./nomic-ai/modernbert-embed-base) | 149 | 768 | 8192 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 | |
|
| [nomic-embed-text-v1.5](https://huggingface.co./nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 | |
|
| [gte-multilingual-base](https://huggingface.co./Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 | |
|
| [jina-embeddings-v3](https://huggingface.co./jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 | |
|
| [**gte-modernbert-base**](https://huggingface.co./Alibaba-NLP/gte-modernbert-base) | 149 | 1024 | 8192 | **64.38** | **76.99** | **46.47** | **85.93** | **59.24** | **55.33** | **81.57** | **30.68** | |
|
|
|
|
|
### LoCo (Long Document Retrieval)(NDCG@10) |
|
|
|
| Model Name | Dimension | Sequence Length | Average (5) | QsmsumRetrieval | SummScreenRetrieval | QasperAbastractRetrieval | QasperTitleRetrieval | GovReportRetrieval | |
|
|:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
|
| [gte-qwen1.5-7b](https://huggingface.co./Alibaba-NLP/gte-qwen1.5-7b) | 4096 | 32768 | 87.57 | 49.37 | 93.10 | 99.67 | 97.54 | 98.21 | |
|
| [gte-large-v1.5](https://huggingface.co./Alibaba-NLP/gte-large-v1.5) |1024 | 8192 | 86.71 | 44.55 | 92.61 | 99.82 | 97.81 | 98.74 | |
|
| [gte-base-v1.5](https://huggingface.co./Alibaba-NLP/gte-base-v1.5) | 768 | 8192 | 87.44 | 49.91 | 91.78 | 99.82 | 97.13 | 98.58 | |
|
| [gte-modernbert-base](https://huggingface.co./Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 88.88 | 54.45 | 93.00 | 99.82 | 98.03 | 98.70 | |
|
| [gte-reranker-modernbert-base](https://huggingface.co./Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 90.68 | 70.86 | 94.06 | 99.73 | 99.11 | 89.67 | |
|
|
|
### COIR (Code Retrieval Task)(NDCG@10) |
|
|
|
| Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql | |
|
|:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
|
| [gte-modernbert-base](https://huggingface.co./Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 79.31 | 94.15 | 93.57 | 94.27 | 91.51 | 93.93 | 90.63 | 88.32 | 83.27 | 76.05 | 85.12 | 88.16 | 77.59 | 57.54 | 82.34 | 85.95 | 71.89 | 35.46 | 43.47 | 91.2 | 61.87 | |
|
| [gte-reranker-modernbert-base](https://huggingface.co./Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.99 | 96.43 | 96.88 | 98.32 | 91.81 | 97.7 | 91.96 | 88.81 | 79.71 | 76.27 | 89.39 | 98.37 | 84.11 | 47.57 | 83.37 | 88.91 | 49.66 | 36.36 | 44.37 | 89.58 | 64.21 | |
|
|
|
### BEIR(NDCG@10) |
|
|
|
| Model Name | Dimension | Sequence Length | Average(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID | |
|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
|
| [gte-modernbert-base](https://huggingface.co./Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 | |
|
| [gte-reranker-modernbert-base](https://huggingface.co./Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 56.73 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 | |
|
|
|
|
|
|
|
## Hiring |
|
|
|
We have open positions for **Research Interns** and **Full-Time Researchers** to join our team at Tongyi Lab. |
|
We are seeking passionate individuals with expertise in representation learning, LLM-driven information retrieval, Retrieval-Augmented Generation (RAG), and agent-based systems. |
|
Our team is located in the vibrant cities of **Beijing** and **Hangzhou**. |
|
If you are driven by curiosity and eager to make a meaningful impact through your work, we would love to hear from you. Please submit your resume along with a brief introduction to <a href="mailto:[email protected]">[email protected]</a>. |
|
|
|
|
|
## Citation |
|
|
|
If you find our paper or models helpful, feel free to give us a cite. |
|
|
|
``` |
|
@inproceedings{zhang2024mgte, |
|
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, |
|
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others}, |
|
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track}, |
|
pages={1393--1412}, |
|
year={2024} |
|
} |
|
|
|
@article{li2023towards, |
|
title={Towards general text embeddings with multi-stage contrastive learning}, |
|
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan}, |
|
journal={arXiv preprint arXiv:2308.03281}, |
|
year={2023} |
|
} |
|
``` |