RomainDarous commited on
Commit
880d5ef
·
verified ·
1 Parent(s): e8088e3

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
2_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 768, "out_features": 512, "bias": true, "activation_function": "torch.nn.modules.activation.Tanh"}
2_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8e186727ab1944c63cd981dec7b02acdf05179e3041a112df3a1dc5d5f790cb
3
+ size 1575072
README.md ADDED
@@ -0,0 +1,913 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ - cs
5
+ - de
6
+ - en
7
+ - et
8
+ - fi
9
+ - fr
10
+ - gu
11
+ - ha
12
+ - hi
13
+ - is
14
+ - ja
15
+ - kk
16
+ - km
17
+ - lt
18
+ - lv
19
+ - pl
20
+ - ps
21
+ - ru
22
+ - ta
23
+ - tr
24
+ - uk
25
+ - xh
26
+ - zh
27
+ - zu
28
+ - ne
29
+ - ro
30
+ - si
31
+ tags:
32
+ - sentence-transformers
33
+ - sentence-similarity
34
+ - feature-extraction
35
+ - generated_from_trainer
36
+ - dataset_size:1327190
37
+ - loss:CoSENTLoss
38
+ base_model: sentence-transformers/distiluse-base-multilingual-cased-v2
39
+ widget:
40
+ - source_sentence: यहाँका केही धार्मिक सम्पदाहरू यस प्रकार रहेका छन्।
41
+ sentences:
42
+ - A party works journalists from advertisements about a massive Himalayan post.
43
+ - Some religious affiliations here remain.
44
+ - In Spain, the strict opposition of Roman Catholic churches is found to have assumed
45
+ a marriage similar to male beach wives.
46
+ - source_sentence: '"We can use this discovery to target both the assembly and stability
47
+ of the capsid, to either prevent the formation of the virus when it infects the
48
+ host cell, or break it apart after it''s formed," Luque said. "This could facilitate
49
+ the characterization and identification of antiviral targets for viruses sharing
50
+ the same icosahedral layout."'
51
+ sentences:
52
+ - FC inter have today released Shefki Kuqi from the club's representative team coach
53
+ duties.
54
+ - '"Wir können diese Entdeckung nutzen, um sowohl die Montage als auch die Stabilität
55
+ des Kapsids anzustreben, um entweder die Bildung des Virus zu verhindern, wenn
56
+ es die Wirtszelle infiziert oder nach seiner Bildung auseinanderbricht", sagte
57
+ Luque. "Dies könnte die Charakterisierung und Identifizierung von antiviralen
58
+ Zielen für Viren erleichtern, die das gleiche ikosaedrische Layout teilen".'
59
+ - Quellen sagen, Jones sei „wütend“, als das goldene Mädchen des Fernsehens bei
60
+ einem angespannten Treffen am Dienstag im Hauptquartier seines Geschäftsimperiums
61
+ in Marlow, Buckinghamshire, zugab, dass ihre neuen Deals - im Wert von bis zu
62
+ 1,5 Millionen Pfund - bedeuteten, dass sie nicht mehr genug Zeit hatte, sich ihrer
63
+ Hausbekleidungs- und Zubehörmarke Truly zu widmen.
64
+ - source_sentence: He possesses a pistol with silver bullets for protection from vampires
65
+ and werewolves.
66
+ sentences:
67
+ - Er besitzt eine Pistole mit silbernen Kugeln zum Schutz vor Vampiren und Werwölfen.
68
+ - Bibimbap umfasst Reis, Spinat, Rettich, Bohnensprossen.
69
+ - BSAC profitierte auch von den großen, aber nicht unbeschränkten persönlichen Vermögen
70
+ von Rhodos und Beit vor ihrem Tod.
71
+ - source_sentence: To the west of the Badger Head Inlier is the Port Sorell Formation,
72
+ a tectonic mélange of marine sediments and dolerite.
73
+ sentences:
74
+ - Er brennt einen Speer und brennt Flammen aus seinem Mund, wenn er wütend ist.
75
+ - Westlich des Badger Head Inlier befindet sich die Port Sorell Formation, eine
76
+ tektonische Mischung aus Sedimenten und Dolerit.
77
+ - Public Lynching and Mob Violence under Modi Government
78
+ - source_sentence: Garnizoana otomană se retrage în sudul Dunării, iar după 164 de
79
+ ani cetatea intră din nou sub stăpânirea europenilor.
80
+ sentences:
81
+ - This is because, once again, we have taken into account the fact that we have
82
+ adopted a large number of legislative proposals.
83
+ - Helsinki University ranks 75th among universities for 2010.
84
+ - Ottoman garnisoana is withdrawing into the south of the Danube and, after 164
85
+ years, it is once again under the control of Europeans.
86
+ datasets:
87
+ - RicardoRei/wmt-da-human-evaluation
88
+ - wmt/wmt20_mlqe_task1
89
+ pipeline_tag: sentence-similarity
90
+ library_name: sentence-transformers
91
+ metrics:
92
+ - pearson_cosine
93
+ - spearman_cosine
94
+ model-index:
95
+ - name: SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
96
+ results:
97
+ - task:
98
+ type: semantic-similarity
99
+ name: Semantic Similarity
100
+ dataset:
101
+ name: sts eval
102
+ type: sts-eval
103
+ metrics:
104
+ - type: pearson_cosine
105
+ value: 0.42415369784945883
106
+ name: Pearson Cosine
107
+ - type: spearman_cosine
108
+ value: 0.4175469519194782
109
+ name: Spearman Cosine
110
+ - type: pearson_cosine
111
+ value: 0.0772713008408403
112
+ name: Pearson Cosine
113
+ - type: spearman_cosine
114
+ value: 0.13050905562438264
115
+ name: Spearman Cosine
116
+ - type: pearson_cosine
117
+ value: 0.16731845692612535
118
+ name: Pearson Cosine
119
+ - type: spearman_cosine
120
+ value: 0.18366199919315862
121
+ name: Spearman Cosine
122
+ - type: pearson_cosine
123
+ value: 0.3567214608388243
124
+ name: Pearson Cosine
125
+ - type: spearman_cosine
126
+ value: 0.3656734148567112
127
+ name: Spearman Cosine
128
+ - type: pearson_cosine
129
+ value: 0.41267092498678554
130
+ name: Pearson Cosine
131
+ - type: spearman_cosine
132
+ value: 0.41036446071667193
133
+ name: Spearman Cosine
134
+ - type: pearson_cosine
135
+ value: 0.5254563854630899
136
+ name: Pearson Cosine
137
+ - type: spearman_cosine
138
+ value: 0.4785530551765603
139
+ name: Spearman Cosine
140
+ - type: pearson_cosine
141
+ value: 0.31194241573567016
142
+ name: Pearson Cosine
143
+ - type: spearman_cosine
144
+ value: 0.2814160300891252
145
+ name: Spearman Cosine
146
+ - task:
147
+ type: semantic-similarity
148
+ name: Semantic Similarity
149
+ dataset:
150
+ name: sts test
151
+ type: sts-test
152
+ metrics:
153
+ - type: pearson_cosine
154
+ value: 0.4253603788235729
155
+ name: Pearson Cosine
156
+ - type: spearman_cosine
157
+ value: 0.4166117661445095
158
+ name: Spearman Cosine
159
+ - type: pearson_cosine
160
+ value: 0.022187134575214738
161
+ name: Pearson Cosine
162
+ - type: spearman_cosine
163
+ value: 0.04647559130832398
164
+ name: Spearman Cosine
165
+ - type: pearson_cosine
166
+ value: 0.15979577569463932
167
+ name: Pearson Cosine
168
+ - type: spearman_cosine
169
+ value: 0.2074497419832692
170
+ name: Spearman Cosine
171
+ - type: pearson_cosine
172
+ value: 0.3698928748443983
173
+ name: Pearson Cosine
174
+ - type: spearman_cosine
175
+ value: 0.3757690724227716
176
+ name: Spearman Cosine
177
+ - type: pearson_cosine
178
+ value: 0.44937864470538347
179
+ name: Pearson Cosine
180
+ - type: spearman_cosine
181
+ value: 0.45866193737582717
182
+ name: Spearman Cosine
183
+ - type: pearson_cosine
184
+ value: 0.4466389646053608
185
+ name: Pearson Cosine
186
+ - type: spearman_cosine
187
+ value: 0.4158920394678395
188
+ name: Spearman Cosine
189
+ - type: pearson_cosine
190
+ value: 0.33243289478987115
191
+ name: Pearson Cosine
192
+ - type: spearman_cosine
193
+ value: 0.2806845193699054
194
+ name: Spearman Cosine
195
+ ---
196
+
197
+ # SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
198
+
199
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) on the [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation), [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) and [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) datasets. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
200
+
201
+ ## Model Details
202
+
203
+ ### Model Description
204
+ - **Model Type:** Sentence Transformer
205
+ - **Base model:** [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) <!-- at revision dad0fa1ee4fa6e982d3adbce87c73c02e6aee838 -->
206
+ - **Maximum Sequence Length:** 128 tokens
207
+ - **Output Dimensionality:** 512 dimensions
208
+ - **Similarity Function:** Cosine Similarity
209
+ - **Training Datasets:**
210
+ - [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation)
211
+ - [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
212
+ - [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
213
+ - [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
214
+ - [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
215
+ - [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
216
+ - [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
217
+ - **Languages:** bn, cs, de, en, et, fi, fr, gu, ha, hi, is, ja, kk, km, lt, lv, pl, ps, ru, ta, tr, uk, xh, zh, zu, ne, ro, si
218
+ <!-- - **License:** Unknown -->
219
+
220
+ ### Model Sources
221
+
222
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
223
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
224
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
225
+
226
+ ### Full Model Architecture
227
+
228
+ ```
229
+ SentenceTransformer(
230
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel
231
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
232
+ (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
233
+ )
234
+ ```
235
+
236
+ ## Usage
237
+
238
+ ### Direct Usage (Sentence Transformers)
239
+
240
+ First install the Sentence Transformers library:
241
+
242
+ ```bash
243
+ pip install -U sentence-transformers
244
+ ```
245
+
246
+ Then you can load this model and run inference.
247
+ ```python
248
+ from sentence_transformers import SentenceTransformer
249
+
250
+ # Download from the 🤗 Hub
251
+ model = SentenceTransformer("RomainDarous/distiluse-base-multilingual-cased-v2-sts")
252
+ # Run inference
253
+ sentences = [
254
+ 'Garnizoana otomană se retrage în sudul Dunării, iar după 164 de ani cetatea intră din nou sub stăpânirea europenilor.',
255
+ 'Ottoman garnisoana is withdrawing into the south of the Danube and, after 164 years, it is once again under the control of Europeans.',
256
+ 'This is because, once again, we have taken into account the fact that we have adopted a large number of legislative proposals.',
257
+ ]
258
+ embeddings = model.encode(sentences)
259
+ print(embeddings.shape)
260
+ # [3, 512]
261
+
262
+ # Get the similarity scores for the embeddings
263
+ similarities = model.similarity(embeddings, embeddings)
264
+ print(similarities.shape)
265
+ # [3, 3]
266
+ ```
267
+
268
+ <!--
269
+ ### Direct Usage (Transformers)
270
+
271
+ <details><summary>Click to see the direct usage in Transformers</summary>
272
+
273
+ </details>
274
+ -->
275
+
276
+ <!--
277
+ ### Downstream Usage (Sentence Transformers)
278
+
279
+ You can finetune this model on your own dataset.
280
+
281
+ <details><summary>Click to expand</summary>
282
+
283
+ </details>
284
+ -->
285
+
286
+ <!--
287
+ ### Out-of-Scope Use
288
+
289
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
290
+ -->
291
+
292
+ ## Evaluation
293
+
294
+ ### Metrics
295
+
296
+ #### Semantic Similarity
297
+
298
+ * Datasets: `sts-eval`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test` and `sts-test`
299
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
300
+
301
+ | Metric | sts-eval | sts-test |
302
+ |:--------------------|:-----------|:-----------|
303
+ | pearson_cosine | 0.4242 | 0.3324 |
304
+ | **spearman_cosine** | **0.4175** | **0.2807** |
305
+
306
+ #### Semantic Similarity
307
+
308
+ * Dataset: `sts-eval`
309
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
310
+
311
+ | Metric | Value |
312
+ |:--------------------|:-----------|
313
+ | pearson_cosine | 0.0773 |
314
+ | **spearman_cosine** | **0.1305** |
315
+
316
+ #### Semantic Similarity
317
+
318
+ * Dataset: `sts-eval`
319
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
320
+
321
+ | Metric | Value |
322
+ |:--------------------|:-----------|
323
+ | pearson_cosine | 0.1673 |
324
+ | **spearman_cosine** | **0.1837** |
325
+
326
+ #### Semantic Similarity
327
+
328
+ * Dataset: `sts-eval`
329
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
330
+
331
+ | Metric | Value |
332
+ |:--------------------|:-----------|
333
+ | pearson_cosine | 0.3567 |
334
+ | **spearman_cosine** | **0.3657** |
335
+
336
+ #### Semantic Similarity
337
+
338
+ * Dataset: `sts-eval`
339
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
340
+
341
+ | Metric | Value |
342
+ |:--------------------|:-----------|
343
+ | pearson_cosine | 0.4127 |
344
+ | **spearman_cosine** | **0.4104** |
345
+
346
+ #### Semantic Similarity
347
+
348
+ * Dataset: `sts-eval`
349
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
350
+
351
+ | Metric | Value |
352
+ |:--------------------|:-----------|
353
+ | pearson_cosine | 0.5255 |
354
+ | **spearman_cosine** | **0.4786** |
355
+
356
+ #### Semantic Similarity
357
+
358
+ * Dataset: `sts-eval`
359
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
360
+
361
+ | Metric | Value |
362
+ |:--------------------|:-----------|
363
+ | pearson_cosine | 0.3119 |
364
+ | **spearman_cosine** | **0.2814** |
365
+
366
+ <!--
367
+ ## Bias, Risks and Limitations
368
+
369
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
370
+ -->
371
+
372
+ <!--
373
+ ### Recommendations
374
+
375
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
376
+ -->
377
+
378
+ ## Training Details
379
+
380
+ ### Training Datasets
381
+
382
+ #### wmt_da
383
+
384
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
385
+ * Size: 1,285,190 training samples
386
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
387
+ * Approximate statistics based on the first 1000 samples:
388
+ | | sentence1 | sentence2 | score |
389
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
390
+ | type | string | string | float |
391
+ | details | <ul><li>min: 4 tokens</li><li>mean: 37.09 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 37.12 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.7</li><li>max: 1.0</li></ul> |
392
+ * Samples:
393
+ | sentence1 | sentence2 | score |
394
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
395
+ | <code>Z dat ÚZIS také vyplývá, že se zastavil úbytek zdravotních sester v nemocnicích.</code> | <code>The data from the IHIS also shows that the decline of nurses in hospitals has stopped.</code> | <code>0.47</code> |
396
+ | <code>Я был самым гордым, самым пьяным девственником, которого кто-либо когда-либо видел.</code> | <code>I was the proudest, most drunk virgin anyone had ever seen.</code> | <code>0.99</code> |
397
+ | <code>Das Trampolinspringen hat einen gewissen Außenseitercharme, teilweise weil es für das unaufgeklärte Ohr passender für eine Clownsschule als die für die Olympischen Spiele klingt.</code> | <code>The trampoline jumping has some outsider charm, in part because it sounds more appropriate for the unenlightened ear for a clowns school than the one for the Olympics.</code> | <code>0.81</code> |
398
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
399
+ ```json
400
+ {
401
+ "scale": 20.0,
402
+ "similarity_fct": "pairwise_cos_sim"
403
+ }
404
+ ```
405
+
406
+ #### mlqe_en_de
407
+
408
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
409
+ * Size: 7,000 training samples
410
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
411
+ * Approximate statistics based on the first 1000 samples:
412
+ | | sentence1 | sentence2 | score |
413
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
414
+ | type | string | string | float |
415
+ | details | <ul><li>min: 11 tokens</li><li>mean: 23.78 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.51 tokens</li><li>max: 54 tokens</li></ul> | <ul><li>min: 0.06</li><li>mean: 0.86</li><li>max: 1.0</li></ul> |
416
+ * Samples:
417
+ | sentence1 | sentence2 | score |
418
+ |:-------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
419
+ | <code>Early Muslim traders and merchants visited Bengal while traversing the Silk Road in the first millennium.</code> | <code>Frühe muslimische Händler und Kaufleute besuchten Bengalen, während sie im ersten Jahrtausend die Seidenstraße durchquerten.</code> | <code>0.9233333468437195</code> |
420
+ | <code>While Fran dissipated shortly after that, the tropical wave progressed into the northeastern Pacific Ocean.</code> | <code>Während Fran kurz danach zerstreute, entwickelte sich die tropische Welle in den nordöstlichen Pazifischen Ozean.</code> | <code>0.8899999856948853</code> |
421
+ | <code>Distressed securities include such events as restructurings, recapitalizations, and bankruptcies.</code> | <code>Zu den belasteten Wertpapieren gehören Restrukturierungen, Rekapitalisierungen und Insolvenzen.</code> | <code>0.9300000071525574</code> |
422
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
423
+ ```json
424
+ {
425
+ "scale": 20.0,
426
+ "similarity_fct": "pairwise_cos_sim"
427
+ }
428
+ ```
429
+
430
+ #### mlqe_en_zh
431
+
432
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
433
+ * Size: 7,000 training samples
434
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
435
+ * Approximate statistics based on the first 1000 samples:
436
+ | | sentence1 | sentence2 | score |
437
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
438
+ | type | string | string | float |
439
+ | details | <ul><li>min: 9 tokens</li><li>mean: 24.09 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 29.93 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 0.98</li></ul> |
440
+ * Samples:
441
+ | sentence1 | sentence2 | score |
442
+ |:-------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------|:---------------------------------|
443
+ | <code>In the late 1980s, the hotel's reputation declined, and it functioned partly as a "backpackers hangout."</code> | <code>在 20 世纪 80 年代末 , 这家旅馆的声誉下降了 , 部分地起到了 "背包吊销" 的作用。</code> | <code>0.40666666626930237</code> |
444
+ | <code>From 1870 to 1915, 36 million Europeans migrated away from Europe.</code> | <code>从 1870 年到 1915 年 , 3, 600 万欧洲人从欧洲移民。</code> | <code>0.8333333730697632</code> |
445
+ | <code>In some photos, the footpads did press into the regolith, especially when they moved sideways at touchdown.</code> | <code>在一些照片中 , 脚垫确实挤进了后台 , 尤其是当他们在触地时侧面移动时。</code> | <code>0.33000001311302185</code> |
446
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
447
+ ```json
448
+ {
449
+ "scale": 20.0,
450
+ "similarity_fct": "pairwise_cos_sim"
451
+ }
452
+ ```
453
+
454
+ #### mlqe_et_en
455
+
456
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
457
+ * Size: 7,000 training samples
458
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
459
+ * Approximate statistics based on the first 1000 samples:
460
+ | | sentence1 | sentence2 | score |
461
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
462
+ | type | string | string | float |
463
+ | details | <ul><li>min: 14 tokens</li><li>mean: 31.88 tokens</li><li>max: 63 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 24.57 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.67</li><li>max: 1.0</li></ul> |
464
+ * Samples:
465
+ | sentence1 | sentence2 | score |
466
+ |:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
467
+ | <code>Gruusias vahistati president Mihhail Saakašvili pressibüroo nõunik Simon Kiladze, keda süüdistati spioneerimises.</code> | <code>In Georgia, an adviser to the press office of President Mikhail Saakashvili, Simon Kiladze, was arrested and accused of spying.</code> | <code>0.9466666579246521</code> |
468
+ | <code>Nii teadmissotsioloogia pooldajad tavaliselt Kuhni tõlgendavadki, arendades tema vaated sõnaselgeks relativismiks.</code> | <code>This is how supporters of knowledge sociology usually interpret Kuhn by developing his views into an explicit relativism.</code> | <code>0.9366666674613953</code> |
469
+ | <code>18. jaanuaril 2003 haarasid mitmeid Canberra eeslinnu võsapõlengud, milles hukkus neli ja sai vigastada 435 inimest.</code> | <code>On 18 January 2003, several of the suburbs of Canberra were seized by debt fires which killed four people and injured 435 people.</code> | <code>0.8666666150093079</code> |
470
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
471
+ ```json
472
+ {
473
+ "scale": 20.0,
474
+ "similarity_fct": "pairwise_cos_sim"
475
+ }
476
+ ```
477
+
478
+ #### mlqe_ne_en
479
+
480
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
481
+ * Size: 7,000 training samples
482
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
483
+ * Approximate statistics based on the first 1000 samples:
484
+ | | sentence1 | sentence2 | score |
485
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
486
+ | type | string | string | float |
487
+ | details | <ul><li>min: 17 tokens</li><li>mean: 40.67 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 24.66 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.39</li><li>max: 1.0</li></ul> |
488
+ * Samples:
489
+ | sentence1 | sentence2 | score |
490
+ |:------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------|:---------------------------------|
491
+ | <code>सामान्‍य बजट प्रायः फेब्रुअरीका अंतिम कार्य दिवसमा लाईन्छ।</code> | <code>A normal budget is usually awarded to the digital working day of February.</code> | <code>0.5600000023841858</code> |
492
+ | <code>कविताका यस्ता स्वरूपमा दुई, तिन वा चार पाउसम्मका मुक्तक, हाइकु, सायरी र लोकसूक्तिहरू पर्दछन् ।</code> | <code>The book consists of two, free of her or four paulets, haiku, Sairi, and locus in such forms.</code> | <code>0.23666666448116302</code> |
493
+ | <code>ब्रिट्नीले यस बारेमा प्रतिक्रिया ब्यक्ता गरदै भनिन,"कुन ठूलो कुरा ���ो र?</code> | <code>Britney did not respond to this, saying "which is a big thing and a big thing?</code> | <code>0.21666665375232697</code> |
494
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
495
+ ```json
496
+ {
497
+ "scale": 20.0,
498
+ "similarity_fct": "pairwise_cos_sim"
499
+ }
500
+ ```
501
+
502
+ #### mlqe_ro_en
503
+
504
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
505
+ * Size: 7,000 training samples
506
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
507
+ * Approximate statistics based on the first 1000 samples:
508
+ | | sentence1 | sentence2 | score |
509
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
510
+ | type | string | string | float |
511
+ | details | <ul><li>min: 12 tokens</li><li>mean: 29.44 tokens</li><li>max: 60 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 22.38 tokens</li><li>max: 65 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
512
+ * Samples:
513
+ | sentence1 | sentence2 | score |
514
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
515
+ | <code>Orașul va fi împărțit în patru districte, iar suburbiile în 10 mahalale.</code> | <code>The city will be divided into four districts and suburbs into 10 mahalals.</code> | <code>0.4699999988079071</code> |
516
+ | <code>La scurt timp după aceasta, au devenit cunoscute debarcările germane de la Trondheim, Bergen și Stavanger, precum și luptele din Oslofjord.</code> | <code>In the light of the above, the Authority concludes that the aid granted to ADIF is compatible with the internal market pursuant to Article 61 (3) (c) of the EEA Agreement.</code> | <code>0.02666666731238365</code> |
517
+ | <code>Până în vara 1791, în Clubul iacobinilor au dominat reprezentanții monarhismului liberal constituțional.</code> | <code>Until the summer of 1791, representatives of liberal constitutional monarchism dominated in the Jacobins Club.</code> | <code>0.8733333349227905</code> |
518
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
519
+ ```json
520
+ {
521
+ "scale": 20.0,
522
+ "similarity_fct": "pairwise_cos_sim"
523
+ }
524
+ ```
525
+
526
+ #### mlqe_si_en
527
+
528
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
529
+ * Size: 7,000 training samples
530
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
531
+ * Approximate statistics based on the first 1000 samples:
532
+ | | sentence1 | sentence2 | score |
533
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
534
+ | type | string | string | float |
535
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.19 tokens</li><li>max: 38 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 22.31 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.51</li><li>max: 1.0</li></ul> |
536
+ * Samples:
537
+ | sentence1 | sentence2 | score |
538
+ |:----------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
539
+ | <code>ඇපලෝ 4 සැටර්න් V බූස්ටරයේ ප්‍රථම පර්යේෂණ පියාසැරිය විය.</code> | <code>The first research flight of the Apollo 4 Saturn V Booster.</code> | <code>0.7966666221618652</code> |
540
+ | <code>මෙහි අවපාතය සැලකීමේ දී, මෙහි 48%ක අවරෝහණය $ මිලියන 125කට අධික චිත්‍රපටයක් ලද තෙවන කුඩාම අවපාතය වේ.</code> | <code>In conjunction with the depression here, 48 % of obesity here is the third smallest depression in over $ 125 million film.</code> | <code>0.17666666209697723</code> |
541
+ | <code>එසේම "බකමූණන් මගින් මෙම රාක්ෂසියගේ රාත්‍රී හැසිරීම සංකේතවත් වන බව" පවසයි.</code> | <code>Also "the owl says that this monster's night behavior is symbolic".</code> | <code>0.8799999952316284</code> |
542
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
543
+ ```json
544
+ {
545
+ "scale": 20.0,
546
+ "similarity_fct": "pairwise_cos_sim"
547
+ }
548
+ ```
549
+
550
+ ### Evaluation Datasets
551
+
552
+ #### wmt_da
553
+
554
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
555
+ * Size: 1,285,190 evaluation samples
556
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
557
+ * Approximate statistics based on the first 1000 samples:
558
+ | | sentence1 | sentence2 | score |
559
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
560
+ | type | string | string | float |
561
+ | details | <ul><li>min: 4 tokens</li><li>mean: 36.52 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 36.59 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.7</li><li>max: 1.0</li></ul> |
562
+ * Samples:
563
+ | sentence1 | sentence2 | score |
564
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
565
+ | <code>The note adds that should the departure from the White House be delayed, a second aircrew would be needed for the return flight due to duty-hour restrictions.</code> | <code>V poznámce se dodává, že pokud by se odlet z Bílého domu zpozdil, byla by pro zpáteční let kvůli omezení pracovní doby nutná druhá letecká posádka.</code> | <code>0.95</code> |
566
+ | <code>上半年电信网络诈骗犯罪上升七成 最高检���结特点-中新网</code> | <code>In the first half of the year, telecommunication network fraud crimes rose by 70%. The highest inspection summary characteristics-Zhongxin.com</code> | <code>0.72</code> |
567
+ | <code>Als zentrale Herausforderungen für den Bundesnachrichtendienst (BND) nannte Merkel den Kampf gegen die Verbreitung von Falschmeldungen im Internet und die Abwehr von Cyberattacken.</code> | <code>Merkel a cité la lutte contre la propagation de fausses nouvelles en ligne et la défense contre les cyberattaques comme des défis majeurs pour le service fédéral de renseignement (BND).</code> | <code>0.87</code> |
568
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
569
+ ```json
570
+ {
571
+ "scale": 20.0,
572
+ "similarity_fct": "pairwise_cos_sim"
573
+ }
574
+ ```
575
+
576
+ #### mlqe_en_de
577
+
578
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
579
+ * Size: 1,000 evaluation samples
580
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
581
+ * Approximate statistics based on the first 1000 samples:
582
+ | | sentence1 | sentence2 | score |
583
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
584
+ | type | string | string | float |
585
+ | details | <ul><li>min: 11 tokens</li><li>mean: 24.11 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.66 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.81</li><li>max: 1.0</li></ul> |
586
+ * Samples:
587
+ | sentence1 | sentence2 | score |
588
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
589
+ | <code>Resuming her patrols, Constitution managed to recapture the American sloop Neutrality on 27 March and, a few days later, the French ship Carteret.</code> | <code>Mit der Wiederaufnahme ihrer Patrouillen gelang es der Verfassung, am 27. März die amerikanische Schleuderneutralität und wenige Tage später das französische Schiff Carteret zurückzuerobern.</code> | <code>0.9033333659172058</code> |
590
+ | <code>Blaine's nomination alienated many Republicans who viewed Blaine as ambitious and immoral.</code> | <code>Blaines Nominierung entfremdete viele Republikaner, die Blaine als ehrgeizig und unmoralisch betrachteten.</code> | <code>0.9216666221618652</code> |
591
+ | <code>This initiated a brief correspondence between the two which quickly descended into political rancor.</code> | <code>Dies leitete eine kurze Korrespondenz zwischen den beiden ein, die schnell zu politischem Groll abstieg.</code> | <code>0.878333330154419</code> |
592
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
593
+ ```json
594
+ {
595
+ "scale": 20.0,
596
+ "similarity_fct": "pairwise_cos_sim"
597
+ }
598
+ ```
599
+
600
+ #### mlqe_en_zh
601
+
602
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
603
+ * Size: 1,000 evaluation samples
604
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
605
+ * Approximate statistics based on the first 1000 samples:
606
+ | | sentence1 | sentence2 | score |
607
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
608
+ | type | string | string | float |
609
+ | details | <ul><li>min: 9 tokens</li><li>mean: 23.75 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 29.56 tokens</li><li>max: 67 tokens</li></ul> | <ul><li>min: 0.26</li><li>mean: 0.65</li><li>max: 0.9</li></ul> |
610
+ * Samples:
611
+ | sentence1 | sentence2 | score |
612
+ |:---------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------|:--------------------------------|
613
+ | <code>Freeman briefly stayed with the king before returning to Accra via Whydah, Ahgwey and Little Popo.</code> | <code>弗里曼在经过惠达、阿格威和小波波回到阿克拉之前与国王一起住了一会儿。</code> | <code>0.6683333516120911</code> |
614
+ | <code>Fantastic Fiction "Scratches in the Sky, Ben Peek, Agog!</code> | <code>奇特的虚构 "天空中的碎片 , 本佩克 , 阿戈 !</code> | <code>0.71833336353302</code> |
615
+ | <code>For Hermann Keller, the running quavers and semiquavers "suffuse the setting with health and strength."</code> | <code>对赫尔曼 · 凯勒来说 , 跑步的跳跃者和半跳跃者 "让环境充满健康和力量" 。</code> | <code>0.7066666483879089</code> |
616
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
617
+ ```json
618
+ {
619
+ "scale": 20.0,
620
+ "similarity_fct": "pairwise_cos_sim"
621
+ }
622
+ ```
623
+
624
+ #### mlqe_et_en
625
+
626
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
627
+ * Size: 1,000 evaluation samples
628
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
629
+ * Approximate statistics based on the first 1000 samples:
630
+ | | sentence1 | sentence2 | score |
631
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
632
+ | type | string | string | float |
633
+ | details | <ul><li>min: 12 tokens</li><li>mean: 32.4 tokens</li><li>max: 58 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.87 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.6</li><li>max: 0.99</li></ul> |
634
+ * Samples:
635
+ | sentence1 | sentence2 | score |
636
+ |:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------|:---------------------------------|
637
+ | <code>Jackson pidas seal kõne, öeldes, et James Brown on tema suurim inspiratsioon.</code> | <code>Jackson gave a speech there saying that James Brown is his greatest inspiration.</code> | <code>0.9833333492279053</code> |
638
+ | <code>Kaanelugu rääkis loo kolme ungarlase üleelamistest Ungari revolutsiooni päevil.</code> | <code>The life of the Man spoke of a story of three Hungarians living in the days of the Hungarian Revolution.</code> | <code>0.28999999165534973</code> |
639
+ | <code>Teise maailmasõja ajal oli ta mitme Saksa juhatusele alluvate eesti väeosa ülem.</code> | <code>During World War II, he was the commander of several of the German leadership.</code> | <code>0.4516666829586029</code> |
640
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
641
+ ```json
642
+ {
643
+ "scale": 20.0,
644
+ "similarity_fct": "pairwise_cos_sim"
645
+ }
646
+ ```
647
+
648
+ #### mlqe_ne_en
649
+
650
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
651
+ * Size: 1,000 evaluation samples
652
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
653
+ * Approximate statistics based on the first 1000 samples:
654
+ | | sentence1 | sentence2 | score |
655
+ |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------|
656
+ | type | string | string | float |
657
+ | details | <ul><li>min: 17 tokens</li><li>mean: 41.03 tokens</li><li>max: 85 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.77 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.05</li><li>mean: 0.36</li><li>max: 0.92</li></ul> |
658
+ * Samples:
659
+ | sentence1 | sentence2 | score |
660
+ |:------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------|
661
+ | <code>१८९२ तिर भवानीदत्त पाण्डेले 'मुद्रा राक्षस'को अनुवाद गरे।</code> | <code>Around 1892, Bhavani Pandit translated the 'money monster'.</code> | <code>0.8416666388511658</code> |
662
+ | <code>यस बच्चाको मुखले आमाको स्तन यस बच्चाको मुखले आमाको स्तन राम्ररी च्यापेको छ ।</code> | <code>The breasts of this child's mouth are taped well with the mother's mouth.</code> | <code>0.2150000035762787</code> |
663
+ | <code>बुवाको बन्दुक चोरेर हिँडेका बराललाई केआई सिंहले अब गोली ल्याउन लगाए ।...</code> | <code>Kei Singh, who stole the boy's closet, took the bullet to bring it now..</code> | <code>0.27000001072883606</code> |
664
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
665
+ ```json
666
+ {
667
+ "scale": 20.0,
668
+ "similarity_fct": "pairwise_cos_sim"
669
+ }
670
+ ```
671
+
672
+ #### mlqe_ro_en
673
+
674
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
675
+ * Size: 1,000 evaluation samples
676
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
677
+ * Approximate statistics based on the first 1000 samples:
678
+ | | sentence1 | sentence2 | score |
679
+ |:--------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------|
680
+ | type | string | string | float |
681
+ | details | <ul><li>min: 14 tokens</li><li>mean: 30.25 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 22.7 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
682
+ * Samples:
683
+ | sentence1 | sentence2 | score |
684
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|
685
+ | <code>Cornwallis se afla înconjurat pe uscat de forțe armate net superioare și retragerea pe mare era îndoielnică din cauza flotei franceze.</code> | <code>Cornwallis was surrounded by shore by higher armed forces and the sea withdrawal was doubtful due to the French fleet.</code> | <code>0.8199999928474426</code> |
686
+ | <code>thumbrightuprightDansatori [[cretani de muzică tradițională.</code> | <code>Number of employees employed in the production of the like product in the Union.</code> | <code>0.009999999776482582</code> |
687
+ | <code>Potrivit documentelor vremii și tradiției orale, aceasta a fost cea mai grea perioadă din istoria orașului.</code> | <code>According to the documents of the oral weather and tradition, this was the hardest period in the city's history.</code> | <code>0.5383332967758179</code> |
688
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
689
+ ```json
690
+ {
691
+ "scale": 20.0,
692
+ "similarity_fct": "pairwise_cos_sim"
693
+ }
694
+ ```
695
+
696
+ #### mlqe_si_en
697
+
698
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
699
+ * Size: 1,000 evaluation samples
700
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
701
+ * Approximate statistics based on the first 1000 samples:
702
+ | | sentence1 | sentence2 | score |
703
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
704
+ | type | string | string | float |
705
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.12 tokens</li><li>max: 36 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 22.18 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.51</li><li>max: 0.99</li></ul> |
706
+ * Samples:
707
+ | sentence1 | sentence2 | score |
708
+ |:----------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:--------------------------------|
709
+ | <code>එයට ශි්‍ර ලංකාවේ සාමය ඇති කිරිමටත් නැති කිරිමටත් පුළුවන්.</code> | <code>It can also cause peace in Sri Lanka.</code> | <code>0.3199999928474426</code> |
710
+ | <code>ඔහු මනෝ විද්‍යාව, සමාජ විද්‍යාව, ඉතිහාසය හා සන්නිවේදනය යන විෂය ක්ෂේත්‍රයන් පිලිබදවද අධ්‍යයනයන් සිදු කිරීමට උත්සාහ කරන ලදි.</code> | <code>He attempted to do subjects in psychology, sociology, history and communication.</code> | <code>0.5366666913032532</code> |
711
+ | <code>එහෙත් කිසිදු මිනිසෙක්‌ හෝ ගැහැනියෙක්‌ එලිමහනක නොවූහ.</code> | <code>But no man or woman was eliminated.</code> | <code>0.2783333361148834</code> |
712
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
713
+ ```json
714
+ {
715
+ "scale": 20.0,
716
+ "similarity_fct": "pairwise_cos_sim"
717
+ }
718
+ ```
719
+
720
+ ### Training Hyperparameters
721
+ #### Non-Default Hyperparameters
722
+
723
+ - `eval_strategy`: steps
724
+ - `per_device_train_batch_size`: 64
725
+ - `per_device_eval_batch_size`: 64
726
+ - `num_train_epochs`: 2
727
+ - `warmup_ratio`: 0.1
728
+
729
+ #### All Hyperparameters
730
+ <details><summary>Click to expand</summary>
731
+
732
+ - `overwrite_output_dir`: False
733
+ - `do_predict`: False
734
+ - `eval_strategy`: steps
735
+ - `prediction_loss_only`: True
736
+ - `per_device_train_batch_size`: 64
737
+ - `per_device_eval_batch_size`: 64
738
+ - `per_gpu_train_batch_size`: None
739
+ - `per_gpu_eval_batch_size`: None
740
+ - `gradient_accumulation_steps`: 1
741
+ - `eval_accumulation_steps`: None
742
+ - `torch_empty_cache_steps`: None
743
+ - `learning_rate`: 5e-05
744
+ - `weight_decay`: 0.0
745
+ - `adam_beta1`: 0.9
746
+ - `adam_beta2`: 0.999
747
+ - `adam_epsilon`: 1e-08
748
+ - `max_grad_norm`: 1.0
749
+ - `num_train_epochs`: 2
750
+ - `max_steps`: -1
751
+ - `lr_scheduler_type`: linear
752
+ - `lr_scheduler_kwargs`: {}
753
+ - `warmup_ratio`: 0.1
754
+ - `warmup_steps`: 0
755
+ - `log_level`: passive
756
+ - `log_level_replica`: warning
757
+ - `log_on_each_node`: True
758
+ - `logging_nan_inf_filter`: True
759
+ - `save_safetensors`: True
760
+ - `save_on_each_node`: False
761
+ - `save_only_model`: False
762
+ - `restore_callback_states_from_checkpoint`: False
763
+ - `no_cuda`: False
764
+ - `use_cpu`: False
765
+ - `use_mps_device`: False
766
+ - `seed`: 42
767
+ - `data_seed`: None
768
+ - `jit_mode_eval`: False
769
+ - `use_ipex`: False
770
+ - `bf16`: False
771
+ - `fp16`: False
772
+ - `fp16_opt_level`: O1
773
+ - `half_precision_backend`: auto
774
+ - `bf16_full_eval`: False
775
+ - `fp16_full_eval`: False
776
+ - `tf32`: None
777
+ - `local_rank`: 0
778
+ - `ddp_backend`: None
779
+ - `tpu_num_cores`: None
780
+ - `tpu_metrics_debug`: False
781
+ - `debug`: []
782
+ - `dataloader_drop_last`: False
783
+ - `dataloader_num_workers`: 0
784
+ - `dataloader_prefetch_factor`: None
785
+ - `past_index`: -1
786
+ - `disable_tqdm`: False
787
+ - `remove_unused_columns`: True
788
+ - `label_names`: None
789
+ - `load_best_model_at_end`: False
790
+ - `ignore_data_skip`: False
791
+ - `fsdp`: []
792
+ - `fsdp_min_num_params`: 0
793
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
794
+ - `fsdp_transformer_layer_cls_to_wrap`: None
795
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
796
+ - `deepspeed`: None
797
+ - `label_smoothing_factor`: 0.0
798
+ - `optim`: adamw_torch
799
+ - `optim_args`: None
800
+ - `adafactor`: False
801
+ - `group_by_length`: False
802
+ - `length_column_name`: length
803
+ - `ddp_find_unused_parameters`: None
804
+ - `ddp_bucket_cap_mb`: None
805
+ - `ddp_broadcast_buffers`: False
806
+ - `dataloader_pin_memory`: True
807
+ - `dataloader_persistent_workers`: False
808
+ - `skip_memory_metrics`: True
809
+ - `use_legacy_prediction_loop`: False
810
+ - `push_to_hub`: False
811
+ - `resume_from_checkpoint`: None
812
+ - `hub_model_id`: None
813
+ - `hub_strategy`: every_save
814
+ - `hub_private_repo`: None
815
+ - `hub_always_push`: False
816
+ - `gradient_checkpointing`: False
817
+ - `gradient_checkpointing_kwargs`: None
818
+ - `include_inputs_for_metrics`: False
819
+ - `include_for_metrics`: []
820
+ - `eval_do_concat_batches`: True
821
+ - `fp16_backend`: auto
822
+ - `push_to_hub_model_id`: None
823
+ - `push_to_hub_organization`: None
824
+ - `mp_parameters`:
825
+ - `auto_find_batch_size`: False
826
+ - `full_determinism`: False
827
+ - `torchdynamo`: None
828
+ - `ray_scope`: last
829
+ - `ddp_timeout`: 1800
830
+ - `torch_compile`: False
831
+ - `torch_compile_backend`: None
832
+ - `torch_compile_mode`: None
833
+ - `dispatch_batches`: None
834
+ - `split_batches`: None
835
+ - `include_tokens_per_second`: False
836
+ - `include_num_input_tokens_seen`: False
837
+ - `neftune_noise_alpha`: None
838
+ - `optim_target_modules`: None
839
+ - `batch_eval_metrics`: False
840
+ - `eval_on_start`: False
841
+ - `use_liger_kernel`: False
842
+ - `eval_use_gather_object`: False
843
+ - `average_tokens_across_devices`: False
844
+ - `prompts`: None
845
+ - `batch_sampler`: batch_sampler
846
+ - `multi_dataset_batch_sampler`: proportional
847
+
848
+ </details>
849
+
850
+ ### Training Logs
851
+ | Epoch | Step | Training Loss | wmt da loss | mlqe en de loss | mlqe en zh loss | mlqe et en loss | mlqe ne en loss | mlqe ro en loss | mlqe si en loss | sts-eval_spearman_cosine | sts-test_spearman_cosine |
852
+ |:-----:|:-----:|:-------------:|:-----------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:------------------------:|:------------------------:|
853
+ | 0.4 | 6690 | 7.8421 | 7.5547 | 7.5619 | 7.5555 | 7.5327 | 7.5354 | 7.5109 | 7.5564 | 0.1989 | - |
854
+ | 0.8 | 13380 | 7.552 | 7.5420 | 7.5757 | 7.5739 | 7.5185 | 7.5126 | 7.4994 | 7.5511 | 0.2336 | - |
855
+ | 1.2 | 20070 | 7.5216 | 7.5465 | 7.6072 | 7.5942 | 7.5217 | 7.5141 | 7.4871 | 7.5471 | 0.2694 | - |
856
+ | 1.6 | 26760 | 7.5024 | 7.5329 | 7.6123 | 7.5814 | 7.5230 | 7.5141 | 7.4679 | 7.5379 | 0.2866 | - |
857
+ | 2.0 | 33450 | 7.495 | 7.5252 | 7.6106 | 7.5756 | 7.5201 | 7.5128 | 7.4725 | 7.5417 | 0.2814 | 0.2807 |
858
+
859
+
860
+ ### Framework Versions
861
+ - Python: 3.11.10
862
+ - Sentence Transformers: 3.3.1
863
+ - Transformers: 4.47.1
864
+ - PyTorch: 2.3.1+cu121
865
+ - Accelerate: 1.2.1
866
+ - Datasets: 3.2.0
867
+ - Tokenizers: 0.21.0
868
+
869
+ ## Citation
870
+
871
+ ### BibTeX
872
+
873
+ #### Sentence Transformers
874
+ ```bibtex
875
+ @inproceedings{reimers-2019-sentence-bert,
876
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
877
+ author = "Reimers, Nils and Gurevych, Iryna",
878
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
879
+ month = "11",
880
+ year = "2019",
881
+ publisher = "Association for Computational Linguistics",
882
+ url = "https://arxiv.org/abs/1908.10084",
883
+ }
884
+ ```
885
+
886
+ #### CoSENTLoss
887
+ ```bibtex
888
+ @online{kexuefm-8847,
889
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
890
+ author={Su Jianlin},
891
+ year={2022},
892
+ month={Jan},
893
+ url={https://kexue.fm/archives/8847},
894
+ }
895
+ ```
896
+
897
+ <!--
898
+ ## Glossary
899
+
900
+ *Clearly define terms in order to be accessible across audiences.*
901
+ -->
902
+
903
+ <!--
904
+ ## Model Card Authors
905
+
906
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
907
+ -->
908
+
909
+ <!--
910
+ ## Model Card Contact
911
+
912
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
913
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v2",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertModel"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_hidden_states": true,
17
+ "output_past": true,
18
+ "pad_token_id": 0,
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.47.1",
25
+ "vocab_size": 119547
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.47.1",
5
+ "pytorch": "2.3.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dbbc5f2d5511cb456059e2355c006214e670f7b2d9b3b879412e673e4aeab832
3
+ size 538947416
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Dense",
18
+ "type": "sentence_transformers.models.Dense"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "max_len": 512,
52
+ "model_max_length": 128,
53
+ "never_split": null,
54
+ "pad_token": "[PAD]",
55
+ "sep_token": "[SEP]",
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "DistilBertTokenizer",
59
+ "unk_token": "[UNK]"
60
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff