deman539 commited on
Commit
41a21be
·
verified ·
1 Parent(s): aa5048f

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,695 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: nomic-ai/nomic-embed-text-v1
3
+ library_name: sentence-transformers
4
+ metrics:
5
+ - cosine_accuracy@1
6
+ - cosine_accuracy@3
7
+ - cosine_accuracy@5
8
+ - cosine_accuracy@10
9
+ - cosine_precision@1
10
+ - cosine_precision@3
11
+ - cosine_precision@5
12
+ - cosine_precision@10
13
+ - cosine_recall@1
14
+ - cosine_recall@3
15
+ - cosine_recall@5
16
+ - cosine_recall@10
17
+ - cosine_ndcg@10
18
+ - cosine_mrr@10
19
+ - cosine_map@100
20
+ - dot_accuracy@1
21
+ - dot_accuracy@3
22
+ - dot_accuracy@5
23
+ - dot_accuracy@10
24
+ - dot_precision@1
25
+ - dot_precision@3
26
+ - dot_precision@5
27
+ - dot_precision@10
28
+ - dot_recall@1
29
+ - dot_recall@3
30
+ - dot_recall@5
31
+ - dot_recall@10
32
+ - dot_ndcg@10
33
+ - dot_mrr@10
34
+ - dot_map@100
35
+ pipeline_tag: sentence-similarity
36
+ tags:
37
+ - sentence-transformers
38
+ - sentence-similarity
39
+ - feature-extraction
40
+ - generated_from_trainer
41
+ - dataset_size:2459
42
+ - loss:MatryoshkaLoss
43
+ - loss:MultipleNegativesRankingLoss
44
+ widget:
45
+ - source_sentence: What types of applications may require confidentiality during their
46
+ launch?
47
+ sentences:
48
+ - "Taken together, the technical protections and practices laid out in the Blueprint\
49
+ \ for an AI Bill of Rights can help \nguard the American public against many of\
50
+ \ the potential and actual harms identified by researchers, technolo­\ngists,\
51
+ \ advocates, journalists, policymakers, and communities in the United States and\
52
+ \ around the world. This \ntechnical companion is intended to be used as a reference\
53
+ \ by people across many circumstances – anyone"
54
+ - "deactivate AI systems that demonstrate performance or outcomes inconsistent with\
55
+ \ intended use. \nAction ID \nSuggested Action \nGAI Risks \nMG-2.4-001 \nEstablish\
56
+ \ and maintain communication plans to inform AI stakeholders as part of \nthe\
57
+ \ deactivation or disengagement process of a specific GAI system (including for\
58
+ \ \nopen-source models) or context of use, including reasons, workarounds, user\
59
+ \ \naccess removal, alternative processes, contact information, etc. \nHuman-AI\
60
+ \ Configuration"
61
+ - "launch may need to be confidential. Government applications, particularly law\
62
+ \ enforcement applications or \napplications that raise national security considerations,\
63
+ \ may require confidential or limited engagement based \non system sensitivities\
64
+ \ and preexisting oversight laws and structures. Concerns raised in this consultation\
65
+ \ \nshould be documented, and the automated system developers were proposing to\
66
+ \ create, use, or deploy should \nbe reconsidered based on this feedback."
67
+ - source_sentence: What is the main focus of the paper by Chandra et al. (2023) regarding
68
+ Chinese influence operations?
69
+ sentences:
70
+ - "https://arxiv.org/abs/2403.06634 \nChandra, B. et al. (2023) Dismantling the\
71
+ \ Disinformation Business of Chinese Influence Operations. \nRAND. https://www.rand.org/pubs/commentary/2023/10/dismantling-the-disinformation-business-of-\n\
72
+ chinese.html \nCiriello, R. et al. (2024) Ethical Tensions in Human-AI Companionship:\
73
+ \ A Dialectical Inquiry into Replika. \nResearchGate. https://www.researchgate.net/publication/374505266_Ethical_Tensions_in_Human-\n\
74
+ AI_Companionship_A_Dialectical_Inquiry_into_Replika"
75
+ - "monocultures,3” resulting from repeated use of the same model, or impacts on\
76
+ \ access to \nopportunity, labor markets, and the creative economies.4 \n• \n\
77
+ Source of risk: Risks may emerge from factors related to the design, training,\
78
+ \ or operation of the \nGAI model itself, stemming in some cases from GAI model\
79
+ \ or system inputs, and in other cases, \nfrom GAI system outputs. Many GAI risks,\
80
+ \ however, originate from human behavior, including"
81
+ - "limited to GAI model or system architecture, training mechanisms and libraries,\
82
+ \ data types used for \ntraining or fine-tuning, levels of model access or availability\
83
+ \ of model weights, and application or use \ncase context. \nOrganizations may\
84
+ \ choose to tailor how they measure GAI risks based on these characteristics.\
85
+ \ They may \nadditionally wish to allocate risk management resources relative\
86
+ \ to the severity and likelihood of"
87
+ - source_sentence: What steps are being taken to enhance transparency and accountability
88
+ in the GAI system?
89
+ sentences:
90
+ - "security, health, foreign relations, the environment, and the technological recovery\
91
+ \ and use of resources, among \nother topics. OSTP leads interagency science and\
92
+ \ technology policy coordination efforts, assists the Office of \nManagement and\
93
+ \ Budget (OMB) with an annual review and analysis of Federal research and development\
94
+ \ in \nbudgets, and serves as a source of scientific and technological analysis\
95
+ \ and judgment for the President with"
96
+ - "steps taken to update the GAI system to enhance transparency and \naccountability.\
97
+ \ \nHuman-AI Configuration; Harmful \nBias and Homogenization \nMG-4.1-006 \nTrack\
98
+ \ dataset modifications for provenance by monitoring data deletions, \nrectification\
99
+ \ requests, and other changes that may impact the verifiability of \ncontent origins.\
100
+ \ \nInformation Integrity"
101
+ - "content. Some well-known techniques for provenance data tracking include digital\
102
+ \ watermarking, \nmetadata recording, digital fingerprinting, and human authentication,\
103
+ \ among others. \nProvenance Data Tracking Approaches \nProvenance data tracking\
104
+ \ techniques for GAI systems can be used to track the history and origin of data\
105
+ \ \ninputs, metadata, and synthetic content. Provenance data tracking records\
106
+ \ the origin and history for"
107
+ - source_sentence: What are some examples of mechanisms for human consideration and
108
+ fallback mentioned in the context?
109
+ sentences:
110
+ - "consequences resulting from the utilization of content provenance approaches\
111
+ \ on users and \ncommunities. Furthermore, organizations can track and document\
112
+ \ the provenance of datasets to identify \ninstances in which AI-generated data\
113
+ \ is a potential root cause of performance issues with the GAI \nsystem. \nA.1.8.\
114
+ \ Incident Disclosure \nOverview \nAI incidents can be defined as an “event, circumstance,\
115
+ \ or series of events where the development, use,"
116
+ - "fully impact rights, opportunities, or access. Automated systems that have greater\
117
+ \ control over outcomes, \nprovide input to high-stakes decisions, relate to sensitive\
118
+ \ domains, or otherwise have a greater potential to \nmeaningfully impact rights,\
119
+ \ opportunities, or access should have greater availability (e.g., staffing) and\
120
+ \ over­\nsight of human consideration and fallback mechanisms. \nAccessible. Mechanisms\
121
+ \ for human consideration and fallback, whether in-person, on paper, by phone,\
122
+ \ or"
123
+ - '•
124
+
125
+ Frida Polli, CEO, Pymetrics
126
+
127
+
128
+
129
+ Karen Levy, Assistant Professor, Department of Information Science, Cornell University
130
+
131
+
132
+
133
+ Natasha Duarte, Project Director, Upturn
134
+
135
+
136
+
137
+ Elana Zeide, Assistant Professor, University of Nebraska College of Law
138
+
139
+
140
+
141
+ Fabian Rogers, Constituent Advocate, Office of NY State Senator Jabari Brisport
142
+ and Community
143
+
144
+ Advocate and Floor Captain, Atlantic Plaza Towers Tenants Association'
145
+ - source_sentence: What mental health issues are associated with the increased use
146
+ of technologies in schools and workplaces?
147
+ sentences:
148
+ - "but this approach may still produce harmful recommendations in response to other\
149
+ \ less-explicit, novel \nprompts (also relevant to CBRN Information or Capabilities,\
150
+ \ Data Privacy, Information Security, and \nObscene, Degrading and/or Abusive\
151
+ \ Content). Crafting such prompts deliberately is known as \n“jailbreaking,” or,\
152
+ \ manipulating prompts to circumvent output controls. Limitations of GAI systems\
153
+ \ can be"
154
+ - "external use, narrow vs. broad application scope, fine-tuning, and varieties of\
155
+ \ \ndata sources (e.g., grounding, retrieval-augmented generation). \nData Privacy;\
156
+ \ Intellectual \nProperty"
157
+ - "technologies has increased in schools and workplaces, and, when coupled with\
158
+ \ consequential management and \nevaluation decisions, it is leading to mental\
159
+ \ health harms such as lowered self-confidence, anxiety, depression, and \na reduced\
160
+ \ ability to use analytical reasoning.61 Documented patterns show that personal\
161
+ \ data is being aggregated by \ndata brokers to profile communities in harmful\
162
+ \ ways.62 The impact of all this data harvesting is corrosive,"
163
+ model-index:
164
+ - name: SentenceTransformer based on nomic-ai/nomic-embed-text-v1
165
+ results:
166
+ - task:
167
+ type: information-retrieval
168
+ name: Information Retrieval
169
+ dataset:
170
+ name: Unknown
171
+ type: unknown
172
+ metrics:
173
+ - type: cosine_accuracy@1
174
+ value: 0.8584142394822006
175
+ name: Cosine Accuracy@1
176
+ - type: cosine_accuracy@3
177
+ value: 0.9838187702265372
178
+ name: Cosine Accuracy@3
179
+ - type: cosine_accuracy@5
180
+ value: 0.9951456310679612
181
+ name: Cosine Accuracy@5
182
+ - type: cosine_accuracy@10
183
+ value: 0.9991909385113269
184
+ name: Cosine Accuracy@10
185
+ - type: cosine_precision@1
186
+ value: 0.8584142394822006
187
+ name: Cosine Precision@1
188
+ - type: cosine_precision@3
189
+ value: 0.32793959007551243
190
+ name: Cosine Precision@3
191
+ - type: cosine_precision@5
192
+ value: 0.1990291262135922
193
+ name: Cosine Precision@5
194
+ - type: cosine_precision@10
195
+ value: 0.09991909385113268
196
+ name: Cosine Precision@10
197
+ - type: cosine_recall@1
198
+ value: 0.8584142394822006
199
+ name: Cosine Recall@1
200
+ - type: cosine_recall@3
201
+ value: 0.9838187702265372
202
+ name: Cosine Recall@3
203
+ - type: cosine_recall@5
204
+ value: 0.9951456310679612
205
+ name: Cosine Recall@5
206
+ - type: cosine_recall@10
207
+ value: 0.9991909385113269
208
+ name: Cosine Recall@10
209
+ - type: cosine_ndcg@10
210
+ value: 0.9417951214306157
211
+ name: Cosine Ndcg@10
212
+ - type: cosine_mrr@10
213
+ value: 0.9220443571171728
214
+ name: Cosine Mrr@10
215
+ - type: cosine_map@100
216
+ value: 0.9221065926163013
217
+ name: Cosine Map@100
218
+ - type: dot_accuracy@1
219
+ value: 0.8584142394822006
220
+ name: Dot Accuracy@1
221
+ - type: dot_accuracy@3
222
+ value: 0.9838187702265372
223
+ name: Dot Accuracy@3
224
+ - type: dot_accuracy@5
225
+ value: 0.9951456310679612
226
+ name: Dot Accuracy@5
227
+ - type: dot_accuracy@10
228
+ value: 0.9991909385113269
229
+ name: Dot Accuracy@10
230
+ - type: dot_precision@1
231
+ value: 0.8584142394822006
232
+ name: Dot Precision@1
233
+ - type: dot_precision@3
234
+ value: 0.32793959007551243
235
+ name: Dot Precision@3
236
+ - type: dot_precision@5
237
+ value: 0.1990291262135922
238
+ name: Dot Precision@5
239
+ - type: dot_precision@10
240
+ value: 0.09991909385113268
241
+ name: Dot Precision@10
242
+ - type: dot_recall@1
243
+ value: 0.8584142394822006
244
+ name: Dot Recall@1
245
+ - type: dot_recall@3
246
+ value: 0.9838187702265372
247
+ name: Dot Recall@3
248
+ - type: dot_recall@5
249
+ value: 0.9951456310679612
250
+ name: Dot Recall@5
251
+ - type: dot_recall@10
252
+ value: 0.9991909385113269
253
+ name: Dot Recall@10
254
+ - type: dot_ndcg@10
255
+ value: 0.9417951214306157
256
+ name: Dot Ndcg@10
257
+ - type: dot_mrr@10
258
+ value: 0.9220443571171728
259
+ name: Dot Mrr@10
260
+ - type: dot_map@100
261
+ value: 0.9221065926163013
262
+ name: Dot Map@100
263
+ ---
264
+
265
+ # SentenceTransformer based on nomic-ai/nomic-embed-text-v1
266
+
267
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
268
+
269
+ ## Model Details
270
+
271
+ ### Model Description
272
+ - **Model Type:** Sentence Transformer
273
+ - **Base model:** [nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) <!-- at revision cc62377b015c53a3bf52bb2f4eb8c55326d3f162 -->
274
+ - **Maximum Sequence Length:** 8192 tokens
275
+ - **Output Dimensionality:** 768 tokens
276
+ - **Similarity Function:** Cosine Similarity
277
+ <!-- - **Training Dataset:** Unknown -->
278
+ <!-- - **Language:** Unknown -->
279
+ <!-- - **License:** Unknown -->
280
+
281
+ ### Model Sources
282
+
283
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
284
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
285
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
286
+
287
+ ### Full Model Architecture
288
+
289
+ ```
290
+ SentenceTransformer(
291
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
292
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
293
+ (2): Normalize()
294
+ )
295
+ ```
296
+
297
+ ## Usage
298
+
299
+ ### Direct Usage (Sentence Transformers)
300
+
301
+ First install the Sentence Transformers library:
302
+
303
+ ```bash
304
+ pip install -U sentence-transformers
305
+ ```
306
+
307
+ Then you can load this model and run inference.
308
+ ```python
309
+ from sentence_transformers import SentenceTransformer
310
+
311
+ # Download from the 🤗 Hub
312
+ model = SentenceTransformer("deman539/nomic-embed-text-v1")
313
+ # Run inference
314
+ sentences = [
315
+ 'What mental health issues are associated with the increased use of technologies in schools and workplaces?',
316
+ 'technologies has increased in schools and workplaces, and, when coupled with consequential management and \nevaluation decisions, it is leading to mental health harms such as lowered self-confidence, anxiety, depression, and \na reduced ability to use analytical reasoning.61 Documented patterns show that personal data is being aggregated by \ndata brokers to profile communities in harmful ways.62 The impact of all this data harvesting is corrosive,',
317
+ 'but this approach may still produce harmful recommendations in response to other less-explicit, novel \nprompts (also relevant to CBRN Information or Capabilities, Data Privacy, Information Security, and \nObscene, Degrading and/or Abusive Content). Crafting such prompts deliberately is known as \n“jailbreaking,” or, manipulating prompts to circumvent output controls. Limitations of GAI systems can be',
318
+ ]
319
+ embeddings = model.encode(sentences)
320
+ print(embeddings.shape)
321
+ # [3, 768]
322
+
323
+ # Get the similarity scores for the embeddings
324
+ similarities = model.similarity(embeddings, embeddings)
325
+ print(similarities.shape)
326
+ # [3, 3]
327
+ ```
328
+
329
+ <!--
330
+ ### Direct Usage (Transformers)
331
+
332
+ <details><summary>Click to see the direct usage in Transformers</summary>
333
+
334
+ </details>
335
+ -->
336
+
337
+ <!--
338
+ ### Downstream Usage (Sentence Transformers)
339
+
340
+ You can finetune this model on your own dataset.
341
+
342
+ <details><summary>Click to expand</summary>
343
+
344
+ </details>
345
+ -->
346
+
347
+ <!--
348
+ ### Out-of-Scope Use
349
+
350
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
351
+ -->
352
+
353
+ ## Evaluation
354
+
355
+ ### Metrics
356
+
357
+ #### Information Retrieval
358
+
359
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
360
+
361
+ | Metric | Value |
362
+ |:--------------------|:-----------|
363
+ | cosine_accuracy@1 | 0.8584 |
364
+ | cosine_accuracy@3 | 0.9838 |
365
+ | cosine_accuracy@5 | 0.9951 |
366
+ | cosine_accuracy@10 | 0.9992 |
367
+ | cosine_precision@1 | 0.8584 |
368
+ | cosine_precision@3 | 0.3279 |
369
+ | cosine_precision@5 | 0.199 |
370
+ | cosine_precision@10 | 0.0999 |
371
+ | cosine_recall@1 | 0.8584 |
372
+ | cosine_recall@3 | 0.9838 |
373
+ | cosine_recall@5 | 0.9951 |
374
+ | cosine_recall@10 | 0.9992 |
375
+ | cosine_ndcg@10 | 0.9418 |
376
+ | cosine_mrr@10 | 0.922 |
377
+ | **cosine_map@100** | **0.9221** |
378
+ | dot_accuracy@1 | 0.8584 |
379
+ | dot_accuracy@3 | 0.9838 |
380
+ | dot_accuracy@5 | 0.9951 |
381
+ | dot_accuracy@10 | 0.9992 |
382
+ | dot_precision@1 | 0.8584 |
383
+ | dot_precision@3 | 0.3279 |
384
+ | dot_precision@5 | 0.199 |
385
+ | dot_precision@10 | 0.0999 |
386
+ | dot_recall@1 | 0.8584 |
387
+ | dot_recall@3 | 0.9838 |
388
+ | dot_recall@5 | 0.9951 |
389
+ | dot_recall@10 | 0.9992 |
390
+ | dot_ndcg@10 | 0.9418 |
391
+ | dot_mrr@10 | 0.922 |
392
+ | dot_map@100 | 0.9221 |
393
+
394
+ <!--
395
+ ## Bias, Risks and Limitations
396
+
397
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
398
+ -->
399
+
400
+ <!--
401
+ ### Recommendations
402
+
403
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
404
+ -->
405
+
406
+ ## Training Details
407
+
408
+ ### Training Dataset
409
+
410
+ #### Unnamed Dataset
411
+
412
+
413
+ * Size: 2,459 training samples
414
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
415
+ * Approximate statistics based on the first 1000 samples:
416
+ | | sentence_0 | sentence_1 |
417
+ |:--------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
418
+ | type | string | string |
419
+ | details | <ul><li>min: 2 tokens</li><li>mean: 18.7 tokens</li><li>max: 35 tokens</li></ul> | <ul><li>min: 22 tokens</li><li>mean: 93.19 tokens</li><li>max: 337 tokens</li></ul> |
420
+ * Samples:
421
+ | sentence_0 | sentence_1 |
422
+ |:-----------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
423
+ | <code>What should organizations include in contracts to evaluate third-party GAI processes and standards?</code> | <code>services acquisition and value chain risk management; and legal compliance. <br>Data Privacy; Information <br>Integrity; Information Security; <br>Intellectual Property; Value Chain <br>and Component Integration <br>GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party <br>GAI processes and standards. <br>Information Integrity <br>GV-6.1-007 Inventory all third-party entities with access to organizational content and <br>establish approved GAI technology and service provider lists.</code> |
424
+ | <code>What steps should be taken to manage third-party entities with access to organizational content?</code> | <code>services acquisition and value chain risk management; and legal compliance. <br>Data Privacy; Information <br>Integrity; Information Security; <br>Intellectual Property; Value Chain <br>and Component Integration <br>GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party <br>GAI processes and standards. <br>Information Integrity <br>GV-6.1-007 Inventory all third-party entities with access to organizational content and <br>establish approved GAI technology and service provider lists.</code> |
425
+ | <code>What should entities responsible for automated systems establish before deploying the system?</code> | <code>Clear organizational oversight. Entities responsible for the development or use of automated systems <br>should lay out clear governance structures and procedures. This includes clearly-stated governance proce­<br>dures before deploying the system, as well as responsibility of specific individuals or entities to oversee ongoing <br>assessment and mitigation. Organizational stakeholders including those with oversight of the business process</code> |
426
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
427
+ ```json
428
+ {
429
+ "loss": "MultipleNegativesRankingLoss",
430
+ "matryoshka_dims": [
431
+ 768,
432
+ 512,
433
+ 256,
434
+ 128,
435
+ 64
436
+ ],
437
+ "matryoshka_weights": [
438
+ 1,
439
+ 1,
440
+ 1,
441
+ 1,
442
+ 1
443
+ ],
444
+ "n_dims_per_step": -1
445
+ }
446
+ ```
447
+
448
+ ### Training Hyperparameters
449
+ #### Non-Default Hyperparameters
450
+
451
+ - `eval_strategy`: steps
452
+ - `per_device_train_batch_size`: 32
453
+ - `per_device_eval_batch_size`: 32
454
+ - `num_train_epochs`: 20
455
+ - `multi_dataset_batch_sampler`: round_robin
456
+
457
+ #### All Hyperparameters
458
+ <details><summary>Click to expand</summary>
459
+
460
+ - `overwrite_output_dir`: False
461
+ - `do_predict`: False
462
+ - `eval_strategy`: steps
463
+ - `prediction_loss_only`: True
464
+ - `per_device_train_batch_size`: 32
465
+ - `per_device_eval_batch_size`: 32
466
+ - `per_gpu_train_batch_size`: None
467
+ - `per_gpu_eval_batch_size`: None
468
+ - `gradient_accumulation_steps`: 1
469
+ - `eval_accumulation_steps`: None
470
+ - `torch_empty_cache_steps`: None
471
+ - `learning_rate`: 5e-05
472
+ - `weight_decay`: 0.0
473
+ - `adam_beta1`: 0.9
474
+ - `adam_beta2`: 0.999
475
+ - `adam_epsilon`: 1e-08
476
+ - `max_grad_norm`: 1
477
+ - `num_train_epochs`: 20
478
+ - `max_steps`: -1
479
+ - `lr_scheduler_type`: linear
480
+ - `lr_scheduler_kwargs`: {}
481
+ - `warmup_ratio`: 0.0
482
+ - `warmup_steps`: 0
483
+ - `log_level`: passive
484
+ - `log_level_replica`: warning
485
+ - `log_on_each_node`: True
486
+ - `logging_nan_inf_filter`: True
487
+ - `save_safetensors`: True
488
+ - `save_on_each_node`: False
489
+ - `save_only_model`: False
490
+ - `restore_callback_states_from_checkpoint`: False
491
+ - `no_cuda`: False
492
+ - `use_cpu`: False
493
+ - `use_mps_device`: False
494
+ - `seed`: 42
495
+ - `data_seed`: None
496
+ - `jit_mode_eval`: False
497
+ - `use_ipex`: False
498
+ - `bf16`: False
499
+ - `fp16`: False
500
+ - `fp16_opt_level`: O1
501
+ - `half_precision_backend`: auto
502
+ - `bf16_full_eval`: False
503
+ - `fp16_full_eval`: False
504
+ - `tf32`: None
505
+ - `local_rank`: 0
506
+ - `ddp_backend`: None
507
+ - `tpu_num_cores`: None
508
+ - `tpu_metrics_debug`: False
509
+ - `debug`: []
510
+ - `dataloader_drop_last`: False
511
+ - `dataloader_num_workers`: 0
512
+ - `dataloader_prefetch_factor`: None
513
+ - `past_index`: -1
514
+ - `disable_tqdm`: False
515
+ - `remove_unused_columns`: True
516
+ - `label_names`: None
517
+ - `load_best_model_at_end`: False
518
+ - `ignore_data_skip`: False
519
+ - `fsdp`: []
520
+ - `fsdp_min_num_params`: 0
521
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
522
+ - `fsdp_transformer_layer_cls_to_wrap`: None
523
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
524
+ - `deepspeed`: None
525
+ - `label_smoothing_factor`: 0.0
526
+ - `optim`: adamw_torch
527
+ - `optim_args`: None
528
+ - `adafactor`: False
529
+ - `group_by_length`: False
530
+ - `length_column_name`: length
531
+ - `ddp_find_unused_parameters`: None
532
+ - `ddp_bucket_cap_mb`: None
533
+ - `ddp_broadcast_buffers`: False
534
+ - `dataloader_pin_memory`: True
535
+ - `dataloader_persistent_workers`: False
536
+ - `skip_memory_metrics`: True
537
+ - `use_legacy_prediction_loop`: False
538
+ - `push_to_hub`: False
539
+ - `resume_from_checkpoint`: None
540
+ - `hub_model_id`: None
541
+ - `hub_strategy`: every_save
542
+ - `hub_private_repo`: False
543
+ - `hub_always_push`: False
544
+ - `gradient_checkpointing`: False
545
+ - `gradient_checkpointing_kwargs`: None
546
+ - `include_inputs_for_metrics`: False
547
+ - `eval_do_concat_batches`: True
548
+ - `fp16_backend`: auto
549
+ - `push_to_hub_model_id`: None
550
+ - `push_to_hub_organization`: None
551
+ - `mp_parameters`:
552
+ - `auto_find_batch_size`: False
553
+ - `full_determinism`: False
554
+ - `torchdynamo`: None
555
+ - `ray_scope`: last
556
+ - `ddp_timeout`: 1800
557
+ - `torch_compile`: False
558
+ - `torch_compile_backend`: None
559
+ - `torch_compile_mode`: None
560
+ - `dispatch_batches`: None
561
+ - `split_batches`: None
562
+ - `include_tokens_per_second`: False
563
+ - `include_num_input_tokens_seen`: False
564
+ - `neftune_noise_alpha`: None
565
+ - `optim_target_modules`: None
566
+ - `batch_eval_metrics`: False
567
+ - `eval_on_start`: False
568
+ - `eval_use_gather_object`: False
569
+ - `batch_sampler`: batch_sampler
570
+ - `multi_dataset_batch_sampler`: round_robin
571
+
572
+ </details>
573
+
574
+ ### Training Logs
575
+ | Epoch | Step | Training Loss | cosine_map@100 |
576
+ |:-------:|:----:|:-------------:|:--------------:|
577
+ | 0.6494 | 50 | - | 0.8493 |
578
+ | 1.0 | 77 | - | 0.8737 |
579
+ | 1.2987 | 100 | - | 0.8677 |
580
+ | 1.9481 | 150 | - | 0.8859 |
581
+ | 2.0 | 154 | - | 0.8886 |
582
+ | 2.5974 | 200 | - | 0.8913 |
583
+ | 3.0 | 231 | - | 0.9058 |
584
+ | 3.2468 | 250 | - | 0.8993 |
585
+ | 3.8961 | 300 | - | 0.9077 |
586
+ | 4.0 | 308 | - | 0.9097 |
587
+ | 4.5455 | 350 | - | 0.9086 |
588
+ | 5.0 | 385 | - | 0.9165 |
589
+ | 5.1948 | 400 | - | 0.9141 |
590
+ | 5.8442 | 450 | - | 0.9132 |
591
+ | 6.0 | 462 | - | 0.9138 |
592
+ | 6.4935 | 500 | 0.3094 | 0.9137 |
593
+ | 7.0 | 539 | - | 0.9166 |
594
+ | 7.1429 | 550 | - | 0.9172 |
595
+ | 7.7922 | 600 | - | 0.9160 |
596
+ | 8.0 | 616 | - | 0.9169 |
597
+ | 8.4416 | 650 | - | 0.9177 |
598
+ | 9.0 | 693 | - | 0.9169 |
599
+ | 9.0909 | 700 | - | 0.9177 |
600
+ | 9.7403 | 750 | - | 0.9178 |
601
+ | 10.0 | 770 | - | 0.9178 |
602
+ | 10.3896 | 800 | - | 0.9189 |
603
+ | 11.0 | 847 | - | 0.9180 |
604
+ | 11.0390 | 850 | - | 0.9180 |
605
+ | 11.6883 | 900 | - | 0.9188 |
606
+ | 12.0 | 924 | - | 0.9192 |
607
+ | 12.3377 | 950 | - | 0.9204 |
608
+ | 12.9870 | 1000 | 0.0571 | 0.9202 |
609
+ | 13.0 | 1001 | - | 0.9201 |
610
+ | 13.6364 | 1050 | - | 0.9212 |
611
+ | 14.0 | 1078 | - | 0.9203 |
612
+ | 14.2857 | 1100 | - | 0.9219 |
613
+ | 14.9351 | 1150 | - | 0.9207 |
614
+ | 15.0 | 1155 | - | 0.9207 |
615
+ | 15.5844 | 1200 | - | 0.9210 |
616
+ | 16.0 | 1232 | - | 0.9208 |
617
+ | 16.2338 | 1250 | - | 0.9216 |
618
+ | 16.8831 | 1300 | - | 0.9209 |
619
+ | 17.0 | 1309 | - | 0.9209 |
620
+ | 17.5325 | 1350 | - | 0.9216 |
621
+ | 18.0 | 1386 | - | 0.9213 |
622
+ | 18.1818 | 1400 | - | 0.9221 |
623
+ | 18.8312 | 1450 | - | 0.9217 |
624
+ | 19.0 | 1463 | - | 0.9217 |
625
+ | 19.4805 | 1500 | 0.0574 | 0.9225 |
626
+ | 20.0 | 1540 | - | 0.9221 |
627
+
628
+
629
+ ### Framework Versions
630
+ - Python: 3.10.12
631
+ - Sentence Transformers: 3.1.1
632
+ - Transformers: 4.44.2
633
+ - PyTorch: 2.4.1+cu121
634
+ - Accelerate: 0.34.2
635
+ - Datasets: 3.0.0
636
+ - Tokenizers: 0.19.1
637
+
638
+ ## Citation
639
+
640
+ ### BibTeX
641
+
642
+ #### Sentence Transformers
643
+ ```bibtex
644
+ @inproceedings{reimers-2019-sentence-bert,
645
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
646
+ author = "Reimers, Nils and Gurevych, Iryna",
647
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
648
+ month = "11",
649
+ year = "2019",
650
+ publisher = "Association for Computational Linguistics",
651
+ url = "https://arxiv.org/abs/1908.10084",
652
+ }
653
+ ```
654
+
655
+ #### MatryoshkaLoss
656
+ ```bibtex
657
+ @misc{kusupati2024matryoshka,
658
+ title={Matryoshka Representation Learning},
659
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
660
+ year={2024},
661
+ eprint={2205.13147},
662
+ archivePrefix={arXiv},
663
+ primaryClass={cs.LG}
664
+ }
665
+ ```
666
+
667
+ #### MultipleNegativesRankingLoss
668
+ ```bibtex
669
+ @misc{henderson2017efficient,
670
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
671
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
672
+ year={2017},
673
+ eprint={1705.00652},
674
+ archivePrefix={arXiv},
675
+ primaryClass={cs.CL}
676
+ }
677
+ ```
678
+
679
+ <!--
680
+ ## Glossary
681
+
682
+ *Clearly define terms in order to be accessible across audiences.*
683
+ -->
684
+
685
+ <!--
686
+ ## Model Card Authors
687
+
688
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
689
+ -->
690
+
691
+ <!--
692
+ ## Model Card Contact
693
+
694
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
695
+ -->
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nomic-ai/nomic-embed-text-v1",
3
+ "activation_function": "swiglu",
4
+ "architectures": [
5
+ "NomicBertModel"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
10
+ "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
11
+ "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
12
+ },
13
+ "bos_token_id": null,
14
+ "causal": false,
15
+ "dense_seq_output": true,
16
+ "embd_pdrop": 0.0,
17
+ "eos_token_id": null,
18
+ "fused_bias_fc": true,
19
+ "fused_dropout_add_ln": true,
20
+ "initializer_range": 0.02,
21
+ "layer_norm_epsilon": 1e-12,
22
+ "max_trained_positions": 2048,
23
+ "mlp_fc1_bias": false,
24
+ "mlp_fc2_bias": false,
25
+ "model_type": "nomic_bert",
26
+ "n_embd": 768,
27
+ "n_head": 12,
28
+ "n_inner": 3072,
29
+ "n_layer": 12,
30
+ "n_positions": 8192,
31
+ "pad_vocab_size_multiple": 64,
32
+ "parallel_block": false,
33
+ "parallel_block_tied_norm": false,
34
+ "prenorm": false,
35
+ "qkv_proj_bias": false,
36
+ "reorder_and_upcast_attn": false,
37
+ "resid_pdrop": 0.0,
38
+ "rotary_emb_base": 1000,
39
+ "rotary_emb_fraction": 1.0,
40
+ "rotary_emb_interleaved": false,
41
+ "rotary_emb_scale_base": null,
42
+ "rotary_scaling_factor": 2,
43
+ "scale_attn_by_inverse_layer_idx": false,
44
+ "scale_attn_weights": true,
45
+ "summary_activation": null,
46
+ "summary_first_dropout": 0.1,
47
+ "summary_proj_to_labels": true,
48
+ "summary_type": "cls_index",
49
+ "summary_use_proj": true,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": "4.44.2",
52
+ "type_vocab_size": 2,
53
+ "use_cache": true,
54
+ "use_flash_attn": true,
55
+ "use_rms_norm": false,
56
+ "use_xentropy": true,
57
+ "vocab_size": 30528
58
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.1.1",
4
+ "transformers": "4.44.2",
5
+ "pytorch": "2.4.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9ce39092416e72850b8942ec2a2178c43fbc35090017a270089cfaf80000fb5
3
+ size 546938168
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 8192,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff