sergeyzh commited on
Commit
c70dbfd
·
verified ·
1 Parent(s): 851acc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -139
README.md CHANGED
@@ -1,139 +1,101 @@
1
- ---
2
- language:
3
- - ru
4
-
5
- pipeline_tag: sentence-similarity
6
-
7
- tags:
8
- - russian
9
- - pretraining
10
- - embeddings
11
- - tiny
12
- - feature-extraction
13
- - sentence-similarity
14
- - sentence-transformers
15
- - transformers
16
-
17
- datasets:
18
- - IlyaGusev/gazeta
19
- - zloelias/lenta-ru
20
-
21
- license: mit
22
- base_model: cointegrated/rubert-tiny2
23
-
24
- ---
25
-
26
- ## Базовый Bert для Semantic text similarity (STS) на CPU
27
-
28
- Базовая модель BERT для расчетов компактных эмбедингов предложений на русском языке. Модель основана на [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) - имеет аналогичные размеры контекста (2048) и ембединга (312), количество слоев увеличено с 3 до 7.
29
-
30
- На STS и близких задачах (PI, NLI, SA, TI) для русского языка превосходит по качеству [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts). Для работы с контекстом свыше 512 токенов требует дообучения под целевой домен.
31
-
32
- ## Выбор модели из серии BERT-STS (качество/скорость)
33
- | Рекомендуемая модель | CPU <br> (STS; snt/s) | GPU <br> (STS; snt/s) |
34
- |:---------------------------------|:---------:|:---------:|
35
- | Быстрая модель (скорость) | [rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) <br> (0.797; 1190) | - |
36
- | Базовая модель (качество) | **rubert-mini-sts <br> (0.815; 539)** | [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) <br> (0.845; 1894) |
37
-
38
- ## Лучшая модель для использования в составе RAG LLMs при инференсе на CPU:
39
- - высокое качество при нечетких запросах (отличный метрики на задачах STS, PI, NLI);
40
- - низкое влияение эмоциональной окраски текста на ембединг (средние показатели на задачах SA, TI);
41
- - легкое расширение базы текстовых документов (скорость работы на CPU > 500 предложений в секунду);
42
- - ускорение алгоритмов knn при поиске соответствий (низкая размерность эмбединга 312);
43
- - простота использования (совместимость с [SentenceTransformer](https://github.com/UKPLab/sentence-transformers)).
44
-
45
- ## Использование модели с библиотекой `transformers`:
46
-
47
- ```python
48
- # pip install transformers sentencepiece
49
- import torch
50
- from transformers import AutoTokenizer, AutoModel
51
- tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-sts")
52
- model = AutoModel.from_pretrained("sergeyzh/rubert-mini-sts")
53
- # model.cuda() # uncomment it if you have a GPU
54
-
55
- def embed_bert_cls(text, model, tokenizer):
56
- t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
57
- with torch.no_grad():
58
- model_output = model(**{k: v.to(model.device) for k, v in t.items()})
59
- embeddings = model_output.last_hidden_state[:, 0, :]
60
- embeddings = torch.nn.functional.normalize(embeddings)
61
- return embeddings[0].cpu().numpy()
62
-
63
- print(embed_bert_cls('привет мир', model, tokenizer).shape)
64
- # (312,)
65
- ```
66
-
67
- ## Использование с `sentence_transformers`:
68
- ```Python
69
- from sentence_transformers import SentenceTransformer, util
70
-
71
- model = SentenceTransformer('sergeyzh/rubert-mini-sts')
72
-
73
- sentences = ["привет мир", "hello world", "здравствуй вселенная"]
74
- embeddings = model.encode(sentences)
75
- print(util.dot_score(embeddings, embeddings))
76
- ```
77
-
78
- ## Метрики
79
- Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
80
-
81
- | Модель | STS | PI | NLI | SA | TI |
82
- |:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
83
- | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
84
- | [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) | 0.845 | 0.737 | 0.481 | 0.805 | 0.957 |
85
- | **sergeyzh/rubert-mini-sts** | **0.815** | **0.723** | **0.477** | **0.791** | **0.949** |
86
- | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
87
- | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
88
- | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
89
- | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
90
-
91
- **Задачи:**
92
-
93
- - Semantic text similarity (**STS**);
94
- - Paraphrase identification (**PI**);
95
- - Natural language inference (**NLI**);
96
- - Sentiment analysis (**SA**);
97
- - Toxicity identification (**TI**).
98
-
99
- ## Быстродействие и размеры
100
-
101
- На бенчмарке [encodechka](https://github.com/avidale/encodechka):
102
-
103
- | Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
104
- |:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
105
- | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
106
- | [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) | 42.835 | 8.561 | 490 | 768 | 512 | 55083 |
107
- | **sergeyzh/rubert-mini-sts** | **6.417** | **5.517** | **123** | **312** | **2048** | **83828** |
108
- | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
109
- | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
110
- | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
111
- | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
112
-
113
-
114
-
115
- При использовании батчей с `sentence_transformers`:
116
-
117
- ```python
118
- from sentence_transformers import SentenceTransformer
119
-
120
- model_name = 'sergeyzh/rubert-mini-sts'
121
- model = SentenceTransformer(model_name, device='cpu')
122
- sentences = ["Тест быстродействия на CPU Ryzen 7 3800X: batch = 500"] * 500
123
- %timeit -n 5 -r 3 model.encode(sentences)
124
-
125
- # 927 ms ± 7.88 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
126
- # 500/0.927 = 539 snt/s
127
-
128
- model = SentenceTransformer(model_name, device='cuda')
129
- sentences = ["Тест быстродействия на GPU RTX 3060: batch = 5000"] * 5000
130
- %timeit -n 5 -r 3 model.encode(sentences)
131
-
132
- # 964 ms ± 26.8 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
133
- # 5000/0.964 = 5187 snt/s
134
- ```
135
-
136
-
137
- ## Связанные ресурсы
138
- Вопросы использования модели обсуждаются в [русскоязычном чате NLP](https://t.me/natural_language_processing).
139
-
 
1
+ ---
2
+ language:
3
+ - ru
4
+
5
+ pipeline_tag: sentence-similarity
6
+
7
+ tags:
8
+ - russian
9
+ - pretraining
10
+ - embeddings
11
+ - tiny
12
+ - feature-extraction
13
+ - sentence-similarity
14
+ - sentence-transformers
15
+ - transformers
16
+
17
+ datasets:
18
+ - IlyaGusev/gazeta
19
+ - zloelias/lenta-ru
20
+
21
+ license: mit
22
+ base_model: cointegrated/rubert-tiny2
23
+
24
+ ---
25
+
26
+ ## Базовый Bert для Semantic text similarity (STS) на CPU
27
+
28
+ Базовая модель BERT для расчетов компактных эмбеддингов предложений на русском языке. Модель основана на [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) - имеет аналогичные размеры контекста (2048) и ембеддинга (312), количество слоев увеличено с 3 до 7.
29
+
30
+
31
+ ## Использование модели с библиотекой `transformers`:
32
+
33
+ ```python
34
+ # pip install transformers sentencepiece
35
+ import torch
36
+ from transformers import AutoTokenizer, AutoModel
37
+ tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-sts")
38
+ model = AutoModel.from_pretrained("sergeyzh/rubert-mini-sts")
39
+ # model.cuda() # uncomment it if you have a GPU
40
+
41
+ def embed_bert_cls(text, model, tokenizer):
42
+ t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
43
+ with torch.no_grad():
44
+ model_output = model(**{k: v.to(model.device) for k, v in t.items()})
45
+ embeddings = model_output.last_hidden_state[:, 0, :]
46
+ embeddings = torch.nn.functional.normalize(embeddings)
47
+ return embeddings[0].cpu().numpy()
48
+
49
+ print(embed_bert_cls('привет мир', model, tokenizer).shape)
50
+ # (312,)
51
+ ```
52
+
53
+ ## Использование с `sentence_transformers`:
54
+ ```Python
55
+ from sentence_transformers import SentenceTransformer, util
56
+
57
+ model = SentenceTransformer('sergeyzh/rubert-mini-sts')
58
+
59
+ sentences = ["привет мир", "hello world", "здравствуй вселенная"]
60
+ embeddings = model.encode(sentences)
61
+ print(util.dot_score(embeddings, embeddings))
62
+ ```
63
+
64
+ ## Метрики
65
+ Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
66
+
67
+ | Модель | STS | PI | NLI | SA | TI |
68
+ |:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
69
+ | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
70
+ | [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) | 0.845 | 0.737 | 0.481 | 0.805 | 0.957 |
71
+ | **sergeyzh/rubert-mini-sts** | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
72
+ | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
73
+ | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
74
+ | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
75
+ | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
76
+
77
+ **Задачи:**
78
+
79
+ - Semantic text similarity (**STS**);
80
+ - Paraphrase identification (**PI**);
81
+ - Natural language inference (**NLI**);
82
+ - Sentiment analysis (**SA**);
83
+ - Toxicity identification (**TI**).
84
+
85
+ ## Быстродействие и размеры
86
+
87
+ На бенчмарке [encodechka](https://github.com/avidale/encodechka):
88
+
89
+ | Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
90
+ |:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
91
+ | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
92
+ | [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) | 42.835 | 8.561 | 490 | 768 | 512 | 55083 |
93
+ | **sergeyzh/rubert-mini-sts** | **6.417** | **5.517** | **123** | **312** | **2048** | **83828** |
94
+ | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
95
+ | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
96
+ | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
97
+ | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
98
+
99
+
100
+
101
+