gte-modernbert-base / README.md
thenlper's picture
Update README.md
2e7d402 verified
|
raw
history blame
12.7 kB
metadata
license: apache-2.0
language:
  - en
base_model:
  - answerdotai/ModernBERT-base
pipeline_tag: sentence-similarity
library_name: transformers
tags:
  - sentence-transformers
  - mteb
  - embedding

gte-modernbert-base

We are excited to introduce the gte-modernbert series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The gte-modernbert series models include both text embedding models and rerank models.

The gte-modernbert models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation.

Model Overview

  • Developed by: Tongyi Lab, Alibaba Group
  • Model Type: Text Embedding
  • Primary Language: English
  • Model Size: 149M
  • Max Input Length: 8192 tokens
  • Output Dimension: 768

Model list

Models Language Model Type Model Size Max Seq. Length Dimension MTEB-en BEIR LoCo CoIR
gte-modernbert-base English text embedding 149M 8192 768 64.38 55.33 87.57 79.31
gte-reranker-modernbert-base English text reranker 149M 8192 - - 56.19 90.68 79.99

Usage

For transformers and sentence-transformers, if your GPU supports it, the efficient Flash Attention 2 will be used automatically if you have flash_attn installed. It is not mandatory.

pip install flash_attn

Use with transformers

# Requires transformers>=4.48.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model_path = "Alibaba-NLP/gte-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# [[42.89073944091797, 71.30911254882812, 33.664554595947266]]

Use with sentence-transformers:

# Requires transformers>=4.48.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
embeddings = model.encode(input_texts)
print(embeddings.shape)
# (4, 768)

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.4289, 0.7131, 0.3366]])

Use with transformers.js:

// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', {
    quantized: false, // Comment out this line to use the quantized version
});

// Generate sentence embeddings
const sentences = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });

// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities);

Training Details

The gte-modernbert series of models follows the training scheme of the previous GTE models, with the only difference being that the pre-training language model base has been replaced from GTE-MLM to ModernBert. For more training details, please refer to our paper: mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Evaluation

MTEB

The results of other models are retrieved from MTEB leaderboard. Given that all models in the gte-modernbert series have a size of less than 1B parameters, we focused exclusively on the results of models under 1B from the MTEB leaderboard.

Model Name Param Size (M) Dimension Sequence Length Average (56) Class. (12) Clust. (11) Pair Class. (3) Reran. (4) Retr. (15) STS (10) Summ. (1)
mxbai-embed-large-v1 335 1024 512 64.68 75.64 46.71 87.2 60.11 54.39 85 32.71
multilingual-e5-large-instruct 560 1024 514 64.41 77.56 47.1 86.19 58.58 52.47 84.78 30.39
bge-large-en-v1.5 335 1024 512 64.23 75.97 46.08 87.12 60.03 54.29 83.11 31.61
gte-base-en-v1.5 137 768 8192 64.11 77.17 46.82 85.33 57.66 54.09 81.97 31.17
bge-base-en-v1.5 109 768 512 63.55 75.53 45.77 86.55 58.86 53.25 82.4 31.07
gte-large-en-v1.5 409 1024 8192 65.39 77.75 47.95 84.63 58.50 57.91 81.43 30.91
modernbert-embed-base 149 768 8192 62.62 74.31 44.98 83.96 56.42 52.89 81.78 31.39
nomic-embed-text-v1.5 768 8192 62.28 73.55 43.93 84.61 55.78 53.01 81.94 30.4
gte-multilingual-base 305 768 8192 61.4 70.89 44.31 84.24 57.47 51.08 82.11 30.58
jina-embeddings-v3 572 1024 8192 65.51 82.58 45.21 84.01 58.13 53.88 85.81 29.71
gte-modernbert-base 149 1024 8192 64.38 76.99 46.47 85.93 59.24 55.33 81.57 30.68

LoCo (Long Document Retrieval)(NDCG@10)

Model Name Dimension Sequence Length Average (5) QsmsumRetrieval SummScreenRetrieval QasperAbastractRetrieval QasperTitleRetrieval GovReportRetrieval
gte-qwen1.5-7b 4096 32768 87.57 49.37 93.10 99.67 97.54 98.21
gte-large-v1.5 1024 8192 86.71 44.55 92.61 99.82 97.81 98.74
gte-base-v1.5 768 8192 87.44 49.91 91.78 99.82 97.13 98.58
gte-modernbert-base 768 8192 88.88 54.45 93.00 99.82 98.03 98.70
gte-reranker-modernbert-base - 8192 90.68 70.86 94.06 99.73 99.11 89.67

COIR (Code Retrieval Task)(NDCG@10)

Model Name Dimension Sequence Length Average(20) CodeSearchNet-ccr-go CodeSearchNet-ccr-java CodeSearchNet-ccr-javascript CodeSearchNet-ccr-php CodeSearchNet-ccr-python CodeSearchNet-ccr-ruby CodeSearchNet-go CodeSearchNet-java CodeSearchNet-javascript CodeSearchNet-php CodeSearchNet-python CodeSearchNet-ruby apps codefeedback-mt codefeedback-st codetrans-contest codetrans-dl cosqa stackoverflow-qa synthetic-text2sql
gte-modernbert-base 768 8192 79.31 94.15 93.57 94.27 91.51 93.93 90.63 88.32 83.27 76.05 85.12 88.16 77.59 57.54 82.34 85.95 71.89 35.46 43.47 91.2 61.87
gte-reranker-modernbert-base - 8192 79.99 96.43 96.88 98.32 91.81 97.7 91.96 88.81 79.71 76.27 89.39 98.37 84.11 47.57 83.37 88.91 49.66 36.36 44.37 89.58 64.21

BEIR(NDCG@10)

Model Name Dimension Sequence Length Average(15) ArguAna ClimateFEVER CQADupstackAndroidRetrieval DBPedia FEVER FiQA2018 HotpotQA MSMARCO NFCorpus NQ QuoraRetrieval SCIDOCS SciFact Touche2020 TRECCOVID
gte-modernbert-base 768 8192 55.33 72.68 37.74 42.63 41.79 91.03 48.81 69.47 40.9 36.44 57.62 88.55 21.29 77.4 21.68 81.95
gte-reranker-modernbert-base - 8192 56.73 69.03 37.79 44.68 47.23 94.54 49.81 78.16 45.38 30.69 64.57 87.77 20.60 73.57 27.36 79.89

Hiring

We have open positions for Research Interns and Full-Time Researchers to join our team at Tongyi Lab. We are seeking passionate individuals with expertise in representation learning, LLM-driven information retrieval, Retrieval-Augmented Generation (RAG), and agent-based systems. Our team is located in the vibrant cities of Beijing and Hangzhou. If you are driven by curiosity and eager to make a meaningful impact through your work, we would love to hear from you. Please submit your resume along with a brief introduction to [email protected].

Citation

If you find our paper or models helpful, feel free to give us a cite.

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}