Update README.md
Browse files
README.md
CHANGED
@@ -8,11 +8,11 @@ pipeline_tag: sentence-similarity
|
|
8 |
library_name: transformers
|
9 |
---
|
10 |
|
11 |
-
# gte-
|
12 |
|
13 |
We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
|
14 |
|
15 |
-
The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as
|
16 |
|
17 |
## Model Overview
|
18 |
|
@@ -21,66 +21,88 @@ The `gte-modernbert` models demonstrates competitive performance in several text
|
|
21 |
- Primary Language: English
|
22 |
- Model Size: 149M
|
23 |
- Max Input Length: 8192 tokens
|
|
|
24 |
|
25 |
### Model list
|
|
|
|
|
26 |
| Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
|
27 |
-
|
28 |
-
| [`gte-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 |
|
29 |
-
| [`gte-reranker-modernbert-base`](
|
30 |
|
31 |
## Usage
|
32 |
|
33 |
Use with `Transformers`
|
34 |
|
35 |
```python
|
36 |
-
# Requires transformers>=4.
|
37 |
-
|
38 |
-
import torch
|
39 |
-
from transformers import
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
|
|
|
|
|
|
|
|
|
|
57 |
```
|
58 |
|
59 |
Use with `sentence-transformers`:
|
60 |
|
61 |
-
Before you start, install the sentence-transformers libraries:
|
62 |
-
```
|
63 |
-
pip install sentence-transformers
|
64 |
-
```
|
65 |
-
|
66 |
-
|
67 |
```python
|
68 |
# Requires sentence_transformers>=2.7.0
|
69 |
-
from sentence_transformers import CrossEncoder
|
70 |
|
71 |
-
|
|
|
72 |
|
73 |
-
|
74 |
-
model_name_or_path,
|
75 |
-
automodel_args={"torch_dtype": "auto"},
|
76 |
-
trust_remote_code=True,
|
77 |
-
)
|
78 |
|
79 |
-
|
80 |
-
|
81 |
-
|
|
|
82 |
|
83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
```
|
85 |
|
86 |
## Training Details
|
@@ -105,7 +127,7 @@ The results of other models are retrieved from [MTEB leaderboard](https://huggin
|
|
105 |
| [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 |
|
106 |
| [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 |
|
107 |
| [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 |
|
108 |
-
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) |
|
109 |
|
110 |
|
111 |
### LoCo (Long Document Retrieval)
|
@@ -122,17 +144,17 @@ The results of other models are retrieved from [MTEB leaderboard](https://huggin
|
|
122 |
|
123 |
| Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql |
|
124 |
|:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
125 |
-
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 |
|
126 |
-
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.
|
127 |
-
|
128 |
-
|
129 |
|
130 |
### BEIR
|
131 |
|
132 |
-
| Model Name | Dimension | Sequence Length | Average(15) | ArguAna
|
133 |
-
|
|
134 |
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 |
|
135 |
-
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |
|
|
|
|
|
136 |
|
137 |
## Citation
|
138 |
|
|
|
8 |
library_name: transformers
|
9 |
---
|
10 |
|
11 |
+
# gte-modernbert-base
|
12 |
|
13 |
We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
|
14 |
|
15 |
+
The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation.
|
16 |
|
17 |
## Model Overview
|
18 |
|
|
|
21 |
- Primary Language: English
|
22 |
- Model Size: 149M
|
23 |
- Max Input Length: 8192 tokens
|
24 |
+
- Output Dimension: 768
|
25 |
|
26 |
### Model list
|
27 |
+
|
28 |
+
|
29 |
| Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
|
30 |
+
|:--------------------------------------------------------------------------------------:|:--------:|:----------------------:|:----------:|:---------------:|:---------:|:-------:|:----:|:----:|:----:|
|
31 |
+
| [`gte-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 | 67.34 | 55.33 | 87.57 | 79.31 |
|
32 |
+
| [`gte-reranker-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | English | text reranker | 149M | 8192 | - | - | 56.19 | 90.68 | 79.99 |
|
33 |
|
34 |
## Usage
|
35 |
|
36 |
Use with `Transformers`
|
37 |
|
38 |
```python
|
39 |
+
# Requires transformers>=4.36.0
|
40 |
+
|
41 |
+
import torch.nn.functional as F
|
42 |
+
from transformers import AutoModel, AutoTokenizer
|
43 |
+
|
44 |
+
input_texts = [
|
45 |
+
"what is the capital of China?",
|
46 |
+
"how to implement quick sort in python?",
|
47 |
+
"Beijing",
|
48 |
+
"sorting algorithms"
|
49 |
+
]
|
50 |
+
|
51 |
+
model_path = 'Alibaba-NLP/gte-modernbert-base'
|
52 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
53 |
+
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
|
54 |
+
|
55 |
+
# Tokenize the input texts
|
56 |
+
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
|
57 |
+
|
58 |
+
outputs = model(**batch_dict)
|
59 |
+
embeddings = outputs.last_hidden_state[:, 0]
|
60 |
+
|
61 |
+
# (Optionally) normalize embeddings
|
62 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
63 |
+
scores = (embeddings[:1] @ embeddings[1:].T) * 100
|
64 |
+
print(scores.tolist())
|
65 |
```
|
66 |
|
67 |
Use with `sentence-transformers`:
|
68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
```python
|
70 |
# Requires sentence_transformers>=2.7.0
|
|
|
71 |
|
72 |
+
from sentence_transformers import SentenceTransformer
|
73 |
+
from sentence_transformers.util import cos_sim
|
74 |
|
75 |
+
sentences = ['That is a happy person', 'That is a very happy person']
|
|
|
|
|
|
|
|
|
76 |
|
77 |
+
model = SentenceTransformer('Alibaba-NLP/gte-modernbert-base', trust_remote_code=True)
|
78 |
+
embeddings = model.encode(sentences)
|
79 |
+
print(cos_sim(embeddings[0], embeddings[1]))
|
80 |
+
```
|
81 |
|
82 |
+
Use with `transformers.js`:
|
83 |
+
|
84 |
+
```js
|
85 |
+
// npm i @xenova/transformers
|
86 |
+
import { pipeline, dot } from '@xenova/transformers';
|
87 |
+
|
88 |
+
// Create feature extraction pipeline
|
89 |
+
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', {
|
90 |
+
quantized: false, // Comment out this line to use the quantized version
|
91 |
+
});
|
92 |
+
|
93 |
+
// Generate sentence embeddings
|
94 |
+
const sentences = [
|
95 |
+
"what is the capital of China?",
|
96 |
+
"how to implement quick sort in python?",
|
97 |
+
"Beijing",
|
98 |
+
"sorting algorithms"
|
99 |
+
]
|
100 |
+
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
|
101 |
+
|
102 |
+
// Compute similarity scores
|
103 |
+
const [source_embeddings, ...document_embeddings ] = output.tolist();
|
104 |
+
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
|
105 |
+
console.log(similarities);
|
106 |
```
|
107 |
|
108 |
## Training Details
|
|
|
127 |
| [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 |
|
128 |
| [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 |
|
129 |
| [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 |
|
130 |
+
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 572 | 1024 | 8192 | 64.38 | 76.99 | 46.47 | 85.93 | 59.24 | 55.33 | 81.57 | 30.68 |
|
131 |
|
132 |
|
133 |
### LoCo (Long Document Retrieval)
|
|
|
144 |
|
145 |
| Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql |
|
146 |
|:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
147 |
+
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 79.31 | 94.15 | 93.57 | 94.27 | 91.51 | 93.93 | 90.63 | 88.32 | 83.27 | 76.05 | 85.12 | 88.16 | 77.59 | 57.54 | 82.34 | 85.95 | 71.89 | 35.46 | 43.47 | 91.2 | 61.87 |
|
148 |
+
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.99 | 96.43 | 96.88 | 98.32 | 91.81 | 97.7 | 91.96 | 88.81 | 79.71 | 76.27 | 89.39 | 98.37 | 84.11 | 47.57 | 83.37 | 88.91 | 49.66 | 36.36 | 44.37 | 89.58 | 64.21 |
|
|
|
|
|
149 |
|
150 |
### BEIR
|
151 |
|
152 |
+
| Model Name | Dimension | Sequence Length | Average(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID |
|
153 |
+
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
154 |
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 |
|
155 |
+
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 56.73 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |
|
156 |
+
|
157 |
+
|
158 |
|
159 |
## Citation
|
160 |
|