thenlper commited on
Commit
14f83e1
·
verified ·
1 Parent(s): 4f868af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -57
README.md CHANGED
@@ -8,7 +8,7 @@ pipeline_tag: sentence-similarity
8
  library_name: transformers
9
  ---
10
 
11
- # gte-modernbert-base
12
 
13
  We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
14
 
@@ -21,7 +21,6 @@ The `gte-modernbert` models demonstrates competitive performance in several text
21
  - Primary Language: English
22
  - Model Size: 149M
23
  - Max Input Length: 8192 tokens
24
- - Output Dimension: 768
25
 
26
  ### Model list
27
  | Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
@@ -36,71 +35,52 @@ Use with `Transformers`
36
  ```python
37
  # Requires transformers>=4.48.0
38
 
39
- import torch.nn.functional as F
40
- from transformers import AutoModel, AutoTokenizer
41
-
42
- input_texts = [
43
- "what is the capital of China?",
44
- "how to implement quick sort in python?",
45
- "Beijing",
46
- "sorting algorithms"
47
- ]
48
-
49
- model_path = 'Alibaba-NLP/gte-modernbert-base'
50
- tokenizer = AutoTokenizer.from_pretrained(model_path)
51
- model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
52
-
53
- # Tokenize the input texts
54
- batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
55
-
56
- outputs = model(**batch_dict)
57
- embeddings = outputs.last_hidden_state[:, 0]
58
-
59
- # (Optionally) normalize embeddings
60
- embeddings = F.normalize(embeddings, p=2, dim=1)
61
- scores = (embeddings[:1] @ embeddings[1:].T) * 100
62
- print(scores.tolist())
63
  ```
64
 
65
  Use with `sentence-transformers`:
66
 
 
 
 
 
 
 
67
  ```python
68
  # Requires sentence_transformers>=2.7.0
 
69
 
70
- from sentence_transformers import SentenceTransformer
71
- from sentence_transformers.util import cos_sim
72
 
73
- sentences = ['That is a happy person', 'That is a very happy person']
 
 
 
 
74
 
75
- model = SentenceTransformer('Alibaba-NLP/gte-modernbert-base', trust_remote_code=True)
76
- embeddings = model.encode(sentences)
77
- print(cos_sim(embeddings[0], embeddings[1]))
78
- ```
79
 
80
- Use with `transformers.js`:
81
-
82
- ```js
83
- // npm i @xenova/transformers
84
- import { pipeline, dot } from '@xenova/transformers';
85
-
86
- // Create feature extraction pipeline
87
- const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', {
88
- quantized: false, // Comment out this line to use the quantized version
89
- });
90
-
91
- // Generate sentence embeddings
92
- const sentences = [
93
- "what is the capital of China?",
94
- "how to implement quick sort in python?",
95
- "Beijing",
96
- "sorting algorithms"
97
- ]
98
- const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
99
-
100
- // Compute similarity scores
101
- const [source_embeddings, ...document_embeddings ] = output.tolist();
102
- const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
103
- console.log(similarities);
104
  ```
105
 
106
  ## Training Details
 
8
  library_name: transformers
9
  ---
10
 
11
+ # gte-reranker-modernbert-base
12
 
13
  We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
14
 
 
21
  - Primary Language: English
22
  - Model Size: 149M
23
  - Max Input Length: 8192 tokens
 
24
 
25
  ### Model list
26
  | Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
 
35
  ```python
36
  # Requires transformers>=4.48.0
37
 
38
+ import torch
39
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
40
+
41
+ model_name_or_path = 'Alibaba-NLP/gte-reranker-modernbert-base'
42
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
43
+ model = AutoModelForSequenceClassification.from_pretrained(
44
+ model_name_or_path, trust_remote_code=True,
45
+ torch_dtype=torch.float16
46
+ )
47
+ model.eval()
48
+
49
+ pairs = [["what is the capital of China?", "Beijing"], ["how to implement quick sort in python?","Introduction of quick sort"], ["how to implement quick sort in python?", "The weather is nice today"]]
50
+
51
+ with torch.no_grad():
52
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
53
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
54
+ print(scores)
55
+
56
+ # tensor([1.2315, 0.5923, 0.3041])
 
 
 
 
 
57
  ```
58
 
59
  Use with `sentence-transformers`:
60
 
61
+ Before you start, install the sentence-transformers libraries:
62
+ ```
63
+ pip install sentence-transformers
64
+ ```
65
+
66
+
67
  ```python
68
  # Requires sentence_transformers>=2.7.0
69
+ from sentence_transformers import CrossEncoder
70
 
71
+ model_name_or_path = 'Alibaba-NLP/gte-reranker-modernbert-base'
 
72
 
73
+ model = CrossEncoder(
74
+ model_name_or_path,
75
+ automodel_args={"torch_dtype": "auto"},
76
+ trust_remote_code=True,
77
+ )
78
 
79
+ pairs = [["what is the capital of China?", "Beijing"], ["how to implement quick sort in python?","Introduction of quick sort"], ["how to implement quick sort in python?", "The weather is nice today"]]
80
+
81
+ scores = model.predict(sentence_pairs, convert_to_tensor=True).tolist()
 
82
 
83
+ print ("scores: ", scores)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ```
85
 
86
  ## Training Details