Add SetFit model
Browse files- 1_Pooling/config.json +10 -0
- README.md +253 -0
- config.json +32 -0
- config_sentence_transformers.json +10 -0
- config_setfit.json +4 -0
- model.safetensors +3 -0
- model_head.pkl +3 -0
- modules.json +20 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +37 -0
- tokenizer.json +0 -0
- tokenizer_config.json +57 -0
- vocab.txt +0 -0
1_Pooling/config.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 768,
|
3 |
+
"pooling_mode_cls_token": true,
|
4 |
+
"pooling_mode_mean_tokens": false,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
7 |
+
"pooling_mode_weightedmean_tokens": false,
|
8 |
+
"pooling_mode_lasttoken": false,
|
9 |
+
"include_prompt": true
|
10 |
+
}
|
README.md
ADDED
@@ -0,0 +1,253 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: BAAI/bge-base-en-v1.5
|
3 |
+
library_name: setfit
|
4 |
+
metrics:
|
5 |
+
- accuracy
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
tags:
|
8 |
+
- setfit
|
9 |
+
- sentence-transformers
|
10 |
+
- text-classification
|
11 |
+
- generated_from_setfit_trainer
|
12 |
+
widget:
|
13 |
+
- text: 'Reasoning:
|
14 |
+
|
15 |
+
The answer correctly states that the College of Arts and Letters at Notre Dame
|
16 |
+
was created in 1842, which is directly supported by the document. The document
|
17 |
+
specifies that the College of Arts and Letters was established in 1842 and is
|
18 |
+
relevant and directly addresses the question without including unnecessary information.
|
19 |
+
|
20 |
+
|
21 |
+
Evaluation:'
|
22 |
+
- text: 'Reasoning:
|
23 |
+
|
24 |
+
The provided answer states, "The average student at Notre Dame travels more than
|
25 |
+
750 miles to study there," which directly addresses the question asked. The document
|
26 |
+
confirms the accuracy of this information with the statement, "the average student
|
27 |
+
traveled more than 750 miles to Notre Dame." The answer is well-grounded in the
|
28 |
+
document, relevant to the specific question, and concisewithout extraneous information.
|
29 |
+
|
30 |
+
|
31 |
+
Evaluation:'
|
32 |
+
- text: 'Reasoning:
|
33 |
+
|
34 |
+
The provided answer correctly identifies Mick LaSalle as the writer for the San
|
35 |
+
Francisco Chronicle who awarded "Spectre" with a perfect score. This is directly
|
36 |
+
supported by the document, which states, "Other positive reviews from Mick LaSalle
|
37 |
+
from the San Francisco Chronicle,gave it a perfect 100 score..."
|
38 |
+
|
39 |
+
|
40 |
+
Evaluation:'
|
41 |
+
- text: 'Reasoning:
|
42 |
+
|
43 |
+
The given answer states that "The Review of Politics was inspired by German Catholic
|
44 |
+
journals and predominantly featured articles written by Karl Marx." While it correctly
|
45 |
+
identifies that the Review of Politics was inspired by German Catholic journals,
|
46 |
+
the claim that it predominantly featured articles written by Karl Marx is incorrect
|
47 |
+
and not supported by the provided document. The document makes no mention of Karl
|
48 |
+
Marx or indicates that his work was featured in the Review. Instead, it lists
|
49 |
+
other intellectual leaders like Gurian, Jacques Maritain, and Leo Richard Ward.
|
50 |
+
|
51 |
+
|
52 |
+
Evaluation:'
|
53 |
+
- text: 'Reasoning:
|
54 |
+
|
55 |
+
The provided document states that Forbes.com ranked Notre Dame 8th among research
|
56 |
+
universities in the United States. The answer given precisely matches this detail
|
57 |
+
from the document. It accurately addresses the specific question asked, without
|
58 |
+
deviating into unrelated topics or providing unnecessary information.
|
59 |
+
|
60 |
+
|
61 |
+
Evaluation:'
|
62 |
+
inference: true
|
63 |
+
model-index:
|
64 |
+
- name: SetFit with BAAI/bge-base-en-v1.5
|
65 |
+
results:
|
66 |
+
- task:
|
67 |
+
type: text-classification
|
68 |
+
name: Text Classification
|
69 |
+
dataset:
|
70 |
+
name: Unknown
|
71 |
+
type: unknown
|
72 |
+
split: test
|
73 |
+
metrics:
|
74 |
+
- type: accuracy
|
75 |
+
value: 0.9491525423728814
|
76 |
+
name: Accuracy
|
77 |
+
---
|
78 |
+
|
79 |
+
# SetFit with BAAI/bge-base-en-v1.5
|
80 |
+
|
81 |
+
This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
|
82 |
+
|
83 |
+
The model has been trained using an efficient few-shot learning technique that involves:
|
84 |
+
|
85 |
+
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
|
86 |
+
2. Training a classification head with features from the fine-tuned Sentence Transformer.
|
87 |
+
|
88 |
+
## Model Details
|
89 |
+
|
90 |
+
### Model Description
|
91 |
+
- **Model Type:** SetFit
|
92 |
+
- **Sentence Transformer body:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)
|
93 |
+
- **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
|
94 |
+
- **Maximum Sequence Length:** 512 tokens
|
95 |
+
- **Number of Classes:** 2 classes
|
96 |
+
<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
|
97 |
+
<!-- - **Language:** Unknown -->
|
98 |
+
<!-- - **License:** Unknown -->
|
99 |
+
|
100 |
+
### Model Sources
|
101 |
+
|
102 |
+
- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
|
103 |
+
- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
|
104 |
+
- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
|
105 |
+
|
106 |
+
### Model Labels
|
107 |
+
| Label | Examples |
|
108 |
+
|:------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
109 |
+
| 1 | <ul><li>'Reasoning:\nThe answer correctly identifies Joan Gaspart as the individual who resigned from the presidency of Barcelona after the team\'s poor showing in the 2003 season. This is directly supported by the document, which explicitly states that "club president Joan Gaspart resigned, his position having been made completely untenable by such a disastrous season on top of the club\'s overall decline in fortunes since he became president three years prior." The answer is concise and directly relevant to the question without including any extraneous information.\n\nEvaluation:'</li><li>"Reasoning:\nThe provided answer directly addresses the question of why it is recommended to hire a professional residential electrician like O'Hara Electric for electrical work in your house. The answer highlights key points such as the hazards of working with electricity, the potential for injury, and the long-term implications of improperly done electrical work. It also mentions the risk involved even in seemingly simple tasks like smoke detector installation and emphasizes the benefits of having the job done correctly the first time by a professional. The details arewell-supported by the document.\n\nEvaluation:"</li><li>'Reasoning:\nThe answer "The title of Aerosmith\'s 1987 comeback album was \'Permanent Vacation\'" is directly supported by the provided document. The document explicitly states, "Aerosmith\'s comeback album Permanent Vacation (1987) would begin a decade long revival of their popularity." The answer is directly related to the question asked and does not deviate into unrelated topics, ensuring conciseness and relevance.\n\nEvaluation:'</li></ul> |
|
110 |
+
| 0 | <ul><li>'Reasoning:\nThe answer provides a well-supported response that aligns directly with the content presented in the document. It addresses various strategies to combat smoking cravings, such as identifying and avoiding triggers, using distractions, and engaging in alternative activities. Specific triggers, like daily routines and social situations, are described in both the answer and the document. Additionally, the advice on using chewing licorice root and engaging in smoke-free activities is related to the suggestions given in the document. The answer is clear, concise, and stays relevant to the question throughout.\n\nFinal Evaluation: \nEvaluation:'</li><li>"Reasoning:\nThe provided answer accurately captures the challenges Amy Bloom faces when starting a significant writing project, as detailed in the document. Notably, it mentions the difficulty of getting started, the need to clear mental space, and to recalibrate her daily life, which are all points grounded in the text. The answer also covers her becoming less involved in everyday life and spending less time on domestic concerns, which aligns well with the provided passage. However, the part about traveling to a remote island with no internet access is not mentioned in the document and appears to be fabricated, which detracts from the answer's context grounding.\n\nFinal Result:"</li><li>'Reasoning:\nThe provided answer incorrectly states the price and location of the 6 bedroom detached house. According to the document, the 6 bedroom detached house is for sale at a price of £950,000 and is located at Willow Drive, Twyford, Reading, Berkshire, RG10. The answer gives a different priceand an incorrect location.\n\nFinal Evaluation:'</li></ul> |
|
111 |
+
|
112 |
+
## Evaluation
|
113 |
+
|
114 |
+
### Metrics
|
115 |
+
| Label | Accuracy |
|
116 |
+
|:--------|:---------|
|
117 |
+
| **all** | 0.9492 |
|
118 |
+
|
119 |
+
## Uses
|
120 |
+
|
121 |
+
### Direct Use for Inference
|
122 |
+
|
123 |
+
First install the SetFit library:
|
124 |
+
|
125 |
+
```bash
|
126 |
+
pip install setfit
|
127 |
+
```
|
128 |
+
|
129 |
+
Then you can load this model and run inference.
|
130 |
+
|
131 |
+
```python
|
132 |
+
from setfit import SetFitModel
|
133 |
+
|
134 |
+
# Download from the 🤗 Hub
|
135 |
+
model = SetFitModel.from_pretrained("Netta1994/setfit_baai_squad_gpt-4o_improved-cot-instructions_chat_few_shot_generated_remove_fin")
|
136 |
+
# Run inference
|
137 |
+
preds = model("Reasoning:
|
138 |
+
The provided answer correctly identifies Mick LaSalle as the writer for the San Francisco Chronicle who awarded \"Spectre\" with a perfect score. This is directly supported by the document, which states, \"Other positive reviews from Mick LaSalle from the San Francisco Chronicle,gave it a perfect 100 score...\"
|
139 |
+
|
140 |
+
Evaluation:")
|
141 |
+
```
|
142 |
+
|
143 |
+
<!--
|
144 |
+
### Downstream Use
|
145 |
+
|
146 |
+
*List how someone could finetune this model on their own dataset.*
|
147 |
+
-->
|
148 |
+
|
149 |
+
<!--
|
150 |
+
### Out-of-Scope Use
|
151 |
+
|
152 |
+
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
153 |
+
-->
|
154 |
+
|
155 |
+
<!--
|
156 |
+
## Bias, Risks and Limitations
|
157 |
+
|
158 |
+
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
159 |
+
-->
|
160 |
+
|
161 |
+
<!--
|
162 |
+
### Recommendations
|
163 |
+
|
164 |
+
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
165 |
+
-->
|
166 |
+
|
167 |
+
## Training Details
|
168 |
+
|
169 |
+
### Training Set Metrics
|
170 |
+
| Training set | Min | Median | Max |
|
171 |
+
|:-------------|:----|:--------|:----|
|
172 |
+
| Word count | 33 | 76.9045 | 176 |
|
173 |
+
|
174 |
+
| Label | Training Sample Count |
|
175 |
+
|:------|:----------------------|
|
176 |
+
| 0 | 95 |
|
177 |
+
| 1 | 104 |
|
178 |
+
|
179 |
+
### Training Hyperparameters
|
180 |
+
- batch_size: (16, 16)
|
181 |
+
- num_epochs: (1, 1)
|
182 |
+
- max_steps: -1
|
183 |
+
- sampling_strategy: oversampling
|
184 |
+
- num_iterations: 20
|
185 |
+
- body_learning_rate: (2e-05, 2e-05)
|
186 |
+
- head_learning_rate: 2e-05
|
187 |
+
- loss: CosineSimilarityLoss
|
188 |
+
- distance_metric: cosine_distance
|
189 |
+
- margin: 0.25
|
190 |
+
- end_to_end: False
|
191 |
+
- use_amp: False
|
192 |
+
- warmup_proportion: 0.1
|
193 |
+
- l2_weight: 0.01
|
194 |
+
- seed: 42
|
195 |
+
- eval_max_steps: -1
|
196 |
+
- load_best_model_at_end: False
|
197 |
+
|
198 |
+
### Training Results
|
199 |
+
| Epoch | Step | Training Loss | Validation Loss |
|
200 |
+
|:------:|:----:|:-------------:|:---------------:|
|
201 |
+
| 0.0020 | 1 | 0.2375 | - |
|
202 |
+
| 0.1004 | 50 | 0.2548 | - |
|
203 |
+
| 0.2008 | 100 | 0.2339 | - |
|
204 |
+
| 0.3012 | 150 | 0.0973 | - |
|
205 |
+
| 0.4016 | 200 | 0.0347 | - |
|
206 |
+
| 0.5020 | 250 | 0.0125 | - |
|
207 |
+
| 0.6024 | 300 | 0.0058 | - |
|
208 |
+
| 0.7028 | 350 | 0.0039 | - |
|
209 |
+
| 0.8032 | 400 | 0.0033 | - |
|
210 |
+
| 0.9036 | 450 | 0.0023 | - |
|
211 |
+
|
212 |
+
### Framework Versions
|
213 |
+
- Python: 3.10.14
|
214 |
+
- SetFit: 1.1.0
|
215 |
+
- Sentence Transformers: 3.1.1
|
216 |
+
- Transformers: 4.44.0
|
217 |
+
- PyTorch: 2.4.0+cu121
|
218 |
+
- Datasets: 3.0.0
|
219 |
+
- Tokenizers: 0.19.1
|
220 |
+
|
221 |
+
## Citation
|
222 |
+
|
223 |
+
### BibTeX
|
224 |
+
```bibtex
|
225 |
+
@article{https://doi.org/10.48550/arxiv.2209.11055,
|
226 |
+
doi = {10.48550/ARXIV.2209.11055},
|
227 |
+
url = {https://arxiv.org/abs/2209.11055},
|
228 |
+
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
|
229 |
+
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
230 |
+
title = {Efficient Few-Shot Learning Without Prompts},
|
231 |
+
publisher = {arXiv},
|
232 |
+
year = {2022},
|
233 |
+
copyright = {Creative Commons Attribution 4.0 International}
|
234 |
+
}
|
235 |
+
```
|
236 |
+
|
237 |
+
<!--
|
238 |
+
## Glossary
|
239 |
+
|
240 |
+
*Clearly define terms in order to be accessible across audiences.*
|
241 |
+
-->
|
242 |
+
|
243 |
+
<!--
|
244 |
+
## Model Card Authors
|
245 |
+
|
246 |
+
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
247 |
+
-->
|
248 |
+
|
249 |
+
<!--
|
250 |
+
## Model Card Contact
|
251 |
+
|
252 |
+
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
253 |
+
-->
|
config.json
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "BAAI/bge-base-en-v1.5",
|
3 |
+
"architectures": [
|
4 |
+
"BertModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"gradient_checkpointing": false,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"id2label": {
|
13 |
+
"0": "LABEL_0"
|
14 |
+
},
|
15 |
+
"initializer_range": 0.02,
|
16 |
+
"intermediate_size": 3072,
|
17 |
+
"label2id": {
|
18 |
+
"LABEL_0": 0
|
19 |
+
},
|
20 |
+
"layer_norm_eps": 1e-12,
|
21 |
+
"max_position_embeddings": 512,
|
22 |
+
"model_type": "bert",
|
23 |
+
"num_attention_heads": 12,
|
24 |
+
"num_hidden_layers": 12,
|
25 |
+
"pad_token_id": 0,
|
26 |
+
"position_embedding_type": "absolute",
|
27 |
+
"torch_dtype": "float32",
|
28 |
+
"transformers_version": "4.44.0",
|
29 |
+
"type_vocab_size": 2,
|
30 |
+
"use_cache": true,
|
31 |
+
"vocab_size": 30522
|
32 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "3.1.1",
|
4 |
+
"transformers": "4.44.0",
|
5 |
+
"pytorch": "2.4.0+cu121"
|
6 |
+
},
|
7 |
+
"prompts": {},
|
8 |
+
"default_prompt_name": null,
|
9 |
+
"similarity_fn_name": null
|
10 |
+
}
|
config_setfit.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"normalize_embeddings": false,
|
3 |
+
"labels": null
|
4 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0b058faaa596edd69b23038325e6860c4a1f1b0f8a03e6f89791a171f09a2c28
|
3 |
+
size 437951328
|
model_head.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f006db18912008dfd2fa5c3b984b1ae0f0f0050b2d4fc3fb193f90d22f67dfa9
|
3 |
+
size 7007
|
modules.json
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
},
|
14 |
+
{
|
15 |
+
"idx": 2,
|
16 |
+
"name": "2",
|
17 |
+
"path": "2_Normalize",
|
18 |
+
"type": "sentence_transformers.models.Normalize"
|
19 |
+
}
|
20 |
+
]
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 512,
|
3 |
+
"do_lower_case": true
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": {
|
3 |
+
"content": "[CLS]",
|
4 |
+
"lstrip": false,
|
5 |
+
"normalized": false,
|
6 |
+
"rstrip": false,
|
7 |
+
"single_word": false
|
8 |
+
},
|
9 |
+
"mask_token": {
|
10 |
+
"content": "[MASK]",
|
11 |
+
"lstrip": false,
|
12 |
+
"normalized": false,
|
13 |
+
"rstrip": false,
|
14 |
+
"single_word": false
|
15 |
+
},
|
16 |
+
"pad_token": {
|
17 |
+
"content": "[PAD]",
|
18 |
+
"lstrip": false,
|
19 |
+
"normalized": false,
|
20 |
+
"rstrip": false,
|
21 |
+
"single_word": false
|
22 |
+
},
|
23 |
+
"sep_token": {
|
24 |
+
"content": "[SEP]",
|
25 |
+
"lstrip": false,
|
26 |
+
"normalized": false,
|
27 |
+
"rstrip": false,
|
28 |
+
"single_word": false
|
29 |
+
},
|
30 |
+
"unk_token": {
|
31 |
+
"content": "[UNK]",
|
32 |
+
"lstrip": false,
|
33 |
+
"normalized": false,
|
34 |
+
"rstrip": false,
|
35 |
+
"single_word": false
|
36 |
+
}
|
37 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "[PAD]",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"100": {
|
12 |
+
"content": "[UNK]",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"101": {
|
20 |
+
"content": "[CLS]",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"102": {
|
28 |
+
"content": "[SEP]",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"103": {
|
36 |
+
"content": "[MASK]",
|
37 |
+
"lstrip": false,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"clean_up_tokenization_spaces": true,
|
45 |
+
"cls_token": "[CLS]",
|
46 |
+
"do_basic_tokenize": true,
|
47 |
+
"do_lower_case": true,
|
48 |
+
"mask_token": "[MASK]",
|
49 |
+
"model_max_length": 512,
|
50 |
+
"never_split": null,
|
51 |
+
"pad_token": "[PAD]",
|
52 |
+
"sep_token": "[SEP]",
|
53 |
+
"strip_accents": null,
|
54 |
+
"tokenize_chinese_chars": true,
|
55 |
+
"tokenizer_class": "BertTokenizer",
|
56 |
+
"unk_token": "[UNK]"
|
57 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|