zyznull commited on
Commit
f687ac3
·
verified ·
1 Parent(s): 14f83e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -51
README.md CHANGED
@@ -8,11 +8,11 @@ pipeline_tag: sentence-similarity
8
  library_name: transformers
9
  ---
10
 
11
- # gte-reranker-modernbert-base
12
 
13
  We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
14
 
15
- The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as **MTEB**, **LoCO**, and **COIR** evaluation.
16
 
17
  ## Model Overview
18
 
@@ -21,66 +21,88 @@ The `gte-modernbert` models demonstrates competitive performance in several text
21
  - Primary Language: English
22
  - Model Size: 149M
23
  - Max Input Length: 8192 tokens
 
24
 
25
  ### Model list
 
 
26
  | Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
27
- |:--------------------------------------------------------------------------------------:|:--------:|:----------------------:|:----------:|:---------------:|:---------:| :-----: | :-----: |
28
- | [`gte-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 | 64.29 | 55.33 | 87.57 | 77.69 |
29
- | [`gte-reranker-modernbert-base`](hhttps://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | English | text reranker | 149M | 8192 | - | 56.19 | 90.68 | 79.31 |
30
 
31
  ## Usage
32
 
33
  Use with `Transformers`
34
 
35
  ```python
36
- # Requires transformers>=4.48.0
37
-
38
- import torch
39
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
40
-
41
- model_name_or_path = 'Alibaba-NLP/gte-reranker-modernbert-base'
42
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
43
- model = AutoModelForSequenceClassification.from_pretrained(
44
- model_name_or_path, trust_remote_code=True,
45
- torch_dtype=torch.float16
46
- )
47
- model.eval()
48
-
49
- pairs = [["what is the capital of China?", "Beijing"], ["how to implement quick sort in python?","Introduction of quick sort"], ["how to implement quick sort in python?", "The weather is nice today"]]
50
-
51
- with torch.no_grad():
52
- inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
53
- scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
54
- print(scores)
55
-
56
- # tensor([1.2315, 0.5923, 0.3041])
 
 
 
 
 
57
  ```
58
 
59
  Use with `sentence-transformers`:
60
 
61
- Before you start, install the sentence-transformers libraries:
62
- ```
63
- pip install sentence-transformers
64
- ```
65
-
66
-
67
  ```python
68
  # Requires sentence_transformers>=2.7.0
69
- from sentence_transformers import CrossEncoder
70
 
71
- model_name_or_path = 'Alibaba-NLP/gte-reranker-modernbert-base'
 
72
 
73
- model = CrossEncoder(
74
- model_name_or_path,
75
- automodel_args={"torch_dtype": "auto"},
76
- trust_remote_code=True,
77
- )
78
 
79
- pairs = [["what is the capital of China?", "Beijing"], ["how to implement quick sort in python?","Introduction of quick sort"], ["how to implement quick sort in python?", "The weather is nice today"]]
80
-
81
- scores = model.predict(sentence_pairs, convert_to_tensor=True).tolist()
 
82
 
83
- print ("scores: ", scores)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ```
85
 
86
  ## Training Details
@@ -105,7 +127,7 @@ The results of other models are retrieved from [MTEB leaderboard](https://huggin
105
  | [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 |
106
  | [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 |
107
  | [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 |
108
- | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 149 | 768 | 8192 | 64.29 | 76.32 | 45.31 | 86.49 | 58.33 | 55.33 | 83.41 | 29.17 |
109
 
110
 
111
  ### LoCo (Long Document Retrieval)
@@ -122,17 +144,17 @@ The results of other models are retrieved from [MTEB leaderboard](https://huggin
122
 
123
  | Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql |
124
  |:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
125
- | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 77.26 | 95.15 | 94.75 | 96.55 | 91.64 | 95.31 | 90.71 | 86.41 | 79.09 | 97.66 | 80.22 | 42.05 | 55.2 | 84.77 | 52.53 |
126
- | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.31 | 94.15 | 93.57 | 94.27 | 91.51 | 93.93 | 90.63 | 88.32 | 83.27 | 76.05 | 85.12 | 88.16 | 77.59 | 57.54 | 82.34 | 85.95 | 71.89 |
127
-
128
-
129
 
130
  ### BEIR
131
 
132
- | Model Name | Dimension | Sequence Length | Average(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID |
133
- | :----: | :----: | :----: | :----: | :----: | :---: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
134
  | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 |
135
- | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |
 
 
136
 
137
  ## Citation
138
 
 
8
  library_name: transformers
9
  ---
10
 
11
+ # gte-modernbert-base
12
 
13
  We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
14
 
15
+ The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation.
16
 
17
  ## Model Overview
18
 
 
21
  - Primary Language: English
22
  - Model Size: 149M
23
  - Max Input Length: 8192 tokens
24
+ - Output Dimension: 768
25
 
26
  ### Model list
27
+
28
+
29
  | Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
30
+ |:--------------------------------------------------------------------------------------:|:--------:|:----------------------:|:----------:|:---------------:|:---------:|:-------:|:----:|:----:|:----:|
31
+ | [`gte-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 | 67.34 | 55.33 | 87.57 | 79.31 |
32
+ | [`gte-reranker-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | English | text reranker | 149M | 8192 | - | - | 56.19 | 90.68 | 79.99 |
33
 
34
  ## Usage
35
 
36
  Use with `Transformers`
37
 
38
  ```python
39
+ # Requires transformers>=4.36.0
40
+
41
+ import torch.nn.functional as F
42
+ from transformers import AutoModel, AutoTokenizer
43
+
44
+ input_texts = [
45
+ "what is the capital of China?",
46
+ "how to implement quick sort in python?",
47
+ "Beijing",
48
+ "sorting algorithms"
49
+ ]
50
+
51
+ model_path = 'Alibaba-NLP/gte-modernbert-base'
52
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
53
+ model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
54
+
55
+ # Tokenize the input texts
56
+ batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
57
+
58
+ outputs = model(**batch_dict)
59
+ embeddings = outputs.last_hidden_state[:, 0]
60
+
61
+ # (Optionally) normalize embeddings
62
+ embeddings = F.normalize(embeddings, p=2, dim=1)
63
+ scores = (embeddings[:1] @ embeddings[1:].T) * 100
64
+ print(scores.tolist())
65
  ```
66
 
67
  Use with `sentence-transformers`:
68
 
 
 
 
 
 
 
69
  ```python
70
  # Requires sentence_transformers>=2.7.0
 
71
 
72
+ from sentence_transformers import SentenceTransformer
73
+ from sentence_transformers.util import cos_sim
74
 
75
+ sentences = ['That is a happy person', 'That is a very happy person']
 
 
 
 
76
 
77
+ model = SentenceTransformer('Alibaba-NLP/gte-modernbert-base', trust_remote_code=True)
78
+ embeddings = model.encode(sentences)
79
+ print(cos_sim(embeddings[0], embeddings[1]))
80
+ ```
81
 
82
+ Use with `transformers.js`:
83
+
84
+ ```js
85
+ // npm i @xenova/transformers
86
+ import { pipeline, dot } from '@xenova/transformers';
87
+
88
+ // Create feature extraction pipeline
89
+ const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', {
90
+ quantized: false, // Comment out this line to use the quantized version
91
+ });
92
+
93
+ // Generate sentence embeddings
94
+ const sentences = [
95
+ "what is the capital of China?",
96
+ "how to implement quick sort in python?",
97
+ "Beijing",
98
+ "sorting algorithms"
99
+ ]
100
+ const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
101
+
102
+ // Compute similarity scores
103
+ const [source_embeddings, ...document_embeddings ] = output.tolist();
104
+ const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
105
+ console.log(similarities);
106
  ```
107
 
108
  ## Training Details
 
127
  | [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 |
128
  | [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 |
129
  | [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 |
130
+ | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 572 | 1024 | 8192 | 64.38 | 76.99 | 46.47 | 85.93 | 59.24 | 55.33 | 81.57 | 30.68 |
131
 
132
 
133
  ### LoCo (Long Document Retrieval)
 
144
 
145
  | Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql |
146
  |:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
147
+ | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 79.31 | 94.15 | 93.57 | 94.27 | 91.51 | 93.93 | 90.63 | 88.32 | 83.27 | 76.05 | 85.12 | 88.16 | 77.59 | 57.54 | 82.34 | 85.95 | 71.89 | 35.46 | 43.47 | 91.2 | 61.87 |
148
+ | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.99 | 96.43 | 96.88 | 98.32 | 91.81 | 97.7 | 91.96 | 88.81 | 79.71 | 76.27 | 89.39 | 98.37 | 84.11 | 47.57 | 83.37 | 88.91 | 49.66 | 36.36 | 44.37 | 89.58 | 64.21 |
 
 
149
 
150
  ### BEIR
151
 
152
+ | Model Name | Dimension | Sequence Length | Average(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID |
153
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
154
  | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 |
155
+ | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 56.73 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |
156
+
157
+
158
 
159
  ## Citation
160