ai-forever commited on
Commit
47d81ba
1 Parent(s): e48ff06

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -3
README.md CHANGED
@@ -1,3 +1,52 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ru
5
+ - en
6
+ tags:
7
+ - PyTorch
8
+ - Transformers
9
+ ---
10
+
11
+ # ru-en-RoBERTa-large model for Sentence Embeddings in Russian and English language.
12
+ The model is described [in this article](<link of our arxiv>)
13
+ Russian MTEB [metrics](<lin of our ruMTEB>)
14
+
15
+ For better quality, use mean token embeddings.
16
+ ## Usage (HuggingFace Models Repository)
17
+ You can use the model directly from the model repository to compute sentence embeddings:
18
+ ```python
19
+ from transformers import AutoTokenizer, AutoModel
20
+ import torch
21
+
22
+ #You might to use two mode for embeddings creation: CLS token embs or MEAN Pooling
23
+
24
+ #Mean Pooling example - Take attention mask into account for correct averaging
25
+ def mean_pooling(model_output, attention_mask):
26
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
27
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
28
+ sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
29
+ sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
30
+ return sum_embeddings / sum_mask
31
+
32
+ #Sentences we want sentence embeddings for
33
+ sentences = ['Привет! Как твои дела?',
34
+ 'А правда, что 42 твое любимое число?']
35
+ #Load AutoModel from huggingface model repository
36
+ tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa")
37
+ model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa")
38
+ #Tokenize sentences
39
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors='pt')
40
+
41
+ #Compute token embeddings
42
+ with torch.no_grad():
43
+ model_output = model(**encoded_input)
44
+
45
+ #In this case, mean pooling
46
+ sentence_mean_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
47
+
48
+ #In this case, cls "pooling"
49
+ last_hidden_states = model_output[0]
50
+ sentence_cls_embeddings = last_hidden_states[:,0]
51
+
52
+ ```