Safetensors
llama
Pclanglais commited on
Commit
381da53
·
verified ·
1 Parent(s): 95fae96

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - PleIAs/common_corpus
5
+ language:
6
+ - en
7
+ - fr
8
+ - de
9
+ - es
10
+ - it
11
+ - nl
12
+ - la
13
+ - pt
14
+ ---
15
+
16
+ <div style="text-align: center;">
17
+ <img src="https://raw.githubusercontent.com/Pleias/logos/d6152d7943905da32a1e04fdfd7708ed9c7eed5e/PleIAs%201_0%20Full%20Logo%20(Black).png" style="width: 80%; margin: 0 auto; display: inline-block;"/>
18
+ </div>
19
+
20
+ **Pleias-nano-1.2b-RAG 0.1** is a specialized language model designed by [Pleias](https://huggingface.co/PleIAs) and trained with [Tracto AI](https://tracto.ai/) for Retrieval-Augmented Generation.
21
+
22
+ Similarly to its base model, [Pleias-nano-1.2b-Preview](https://huggingface.co/PleIAs/Pleias-nano-1.2b-Preview), Pleias-nano-1.2b-RAG 0.1 aims to be a fully open model (weights, code, data), only trained on content with a permissible license and fully compliant with the upcoming European AI Act.
23
+
24
+ ## Description
25
+ Pleias-nano-1.2b-RAG is continuous pretraining of Pleias-nano-1.2b-Preview on a new dataset of 45,088,768,000 tokens modeling common retrieval tasks. All the content of the dataset is ultimately coming from Common Corpus.
26
+
27
+ Pleias-nano-1.2b-RAG includes the main features of the original base model:
28
+ * Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
29
+ * Extensive multilingual support for main European languages: English, French, German, Spanish, Italian, Dutch, Latin, Portuguese and Polish.
30
+ * Extremely low level of toxicity and problematic content.
31
+
32
+ Pleias-nano-1.2b-RAG supports retrieval-augmented generation with enhanced verifiability, source analysis and grounding on submitted sources. This includes:
33
+ * Standardized structure and special tokens to include queries, sources, .
34
+ * Anticipation of various query forms in multiple languages, from actual drafted questions to unstructured list of keyword search.
35
+ * Source analysis/criticism which also acts as an integrated reranker step.
36
+ * Generation of ground answers with references and excerpts linked to the original sources.
37
+
38
+ While the base model Pleias-nano-1.2b-RAG has been made available as an experimental preview, we release Pleias-nano-1.2b-RAG 0.1 as an early version. Pleias-nano-1.2b-RAG 0.1 has been already tested and integrated into multiple applied RAG projects, including Pleias flagship application Scholasticai.
39
+
40
+ ## Training
41
+ Pleias-nano-1.2b-RAG was trained pretrained on TractoAI on ISEG GPU cluster by Nebius AI on the fork Nanotron developed by TractoAI. We provide the complete settings as a yaml file as part of our release.
42
+
43
+ Pleias-nano-1.2b-RAG derives from the last checkpoint of Pleias-nano-1.2b-Preview (369,000). The training schedule reused the last learning rate value (5e-6) without decay for 43,000 steps. Each step is about 10 time smaller than the original steps from the base model training (roughly 1M tokens per step vs. 12M tokens)
44
+
45
+ Training covers the entire RAG dataset we have been designing based on [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) for 3 epochs.
46
+
47
+ Further experiments were made with different learning rate values: none of theses tests have provided a better convergence than the one obtained with the final learning rate from the base model.
48
+
49
+ ## Inference
50
+ Pleias-nano-1.2b-RAG relies on special tokens to encode the core RAG functionalities:
51
+
52
+ A typical example, with excerpts drawn from a Wikipedia article on Wikipedia
53
+ ```bash
54
+ <|query_start|>Is Wikipedia reliable?<|query_end|>
55
+ <|source_start|><|source_id_start|>ebea70a3502acfbd<|source_id_end|>Articles for traditional encyclopedias such as Encyclopædia Britannica are written by experts, lending such encyclopedias a reputation for accuracy.[144] However, a peer review in 2005 of forty-two scientific entries on both Wikipedia and Encyclopædia Britannica by the science journal Nature found few differences in accuracy, and concluded that "the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three."[145] Joseph Reagle suggested that while the study reflects "a topical strength of Wikipedia contributors" in science articles, "Wikipedia may not have fared so well using a random sampling of articles or on humanities subjects."<|source_end|>
56
+ <|source_start|><|source_id_start|>5f862e733d38288e<|source_id_end|>As a consequence of the open structure, Wikipedia "makes no guarantee of validity" of its content, since no one is ultimately responsible for any claims appearing in it.[W 54] Concerns have been raised by PC World in 2009 regarding the lack of accountability that results from users' anonymity, the insertion of false information,[152] vandalism, and similar problems. Legal Research in a Nutshell (2011), cites Wikipedia as a "general source" that "can be a real boon" in "coming up to speed in the law governing a situation" and, "while not authoritative, can provide basic facts as well as leads to more in-depth resources".<|source_end|>
57
+ <|source_start|><|source_id_start|>354fa4908152b336<|source_id_end|>Wikipedia's open structure inherently makes it an easy target for Internet trolls, spammers, and various forms of paid advocacy seen as counterproductive to the maintenance of a neutral and verifiable online encyclopedia.[70][W 55] In response to paid advocacy editing and undisclosed editing issues, Wikipedia was reported in an article in The Wall Street Journal to have strengthened its rules and laws against undisclosed editing.[162] The article stated that: "Beginning Monday [from the date of the article, June 16, 2014], changes in Wikipedia's terms of use will require anyone paid to edit articles to disclose that arrangement. Katherine Maher, the nonprofit Wikimedia Foundation's chief communications officer, said the changes address a sentiment among volunteer editors that 'we're not an advertising service; we're an encyclopedia.'"<|source_end|>
58
+ <|source_analysis_start|>
59
+ ```
60
+
61
+ As a specialized language model, PleIAs-1.2b-RAG will be unable to work properly with prompts that detracts from that design.
62
+
63
+ ### RAG Evaluation
64
+
65
+ We evaluate Pico and Nano models on a RAG task. As existing benchmarks are largely limited to English, we develop a custom multilingual RAG benchmark. We synthetically generate queries and small sets of documents. To evaluate, we prompted models with the query and documents. We then ran a head-to-head ELO-based tournament with GPT-4o as judge. We [release the prompts and generations for all models we compared](https://huggingface.co/datasets/PleIAs/Pleias-1.0-eval/tree/main/RAGarena). Our nano (1.2B) model outperforms Llama 3.2 1.1B and EuroLLM 1.7B. Our pico (350M) model outperforms other models in its weight class, such as SmolLM 360M and Qwen2.5 500M, in addition to much larger models, such as Llama 3.2 1.1B and EuroLLM 1.7B.
66
+
67
+ | **Rank** | **Model** | **ELO** |
68
+ |----------|--------------------------|------------|
69
+ | 1 | Qwen2.5-Instruct-7B | 1294.6 |
70
+ | 2 | Llama-3.2-Instruct-8B | 1269.8 |
71
+ | 3 | **Pleias-nano-1.2B-RAG** | **1137.5** |
72
+ | 4 | Llama-3.2-Instruct-3B | 1118.1 |
73
+ | 5 | Qwen2.5-Instruct-3B | 1078.1 |
74
+ | 6 | **Pleias-pico-350M-RAG** | **1051.2** |
75
+ | 7 | Llama-3.2-1B-Instruct | 872.3 |
76
+ | 8 | EuroLLM-1.7B-Instruct | 860.0 |
77
+ | 9 | SmolLM-360M-Instruct | 728.6 |
78
+ | 10 | Qwen2.5-0.5B-Instruct | 722.2 |
79
+ | 11 | SmolLM-1.7B-Instruct | 706.3 |
80
+
81
+ ## Acceptable use
82
+ Pleias-nano-1.2b-RAG includes a much wider range of support for verifiability and grounding than most generalist models.
83
+
84
+ The model is not a substitute for an integrated RAG application. Retrieval errors as well as challenging texts and questions can still create a range of issues. We especially encourage end users to take advantage of the citations and the references to provide better indicators of accuracy.
85
+
86
+ For best results we recommend the following setting:
87
+ * Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
88
+ * Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
89
+
90
+ Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 4 tons CO2eq for training.
91
+
92
+ ## Ethical Considerations
93
+
94
+ pleias-nano-1.2b-RAG model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.
95
+
96
+ To address this, we implemented a systematic ethical filtering process using toxicity classifiers to identify extremely harmful content. We also employed synthetic rewriting techniques to transform mildly problematic passages while preserving the underlying informational value. This process significantly reduced potential societal harm without compromising the dataset's size or textual quality, resulting in notably low toxicity scores in benchmarks compared to other models.
97
+
98
+ Despite these preventive measures, users should be aware that the model has not undergone additional safety alignment procedures and may still produce problematic outputs. The model's capabilities in generative AI tasks must be balanced against the risks of bias, misinformation propagation, and autonomous decision-making challenges. We explicitly prohibit any malicious utilization and emphasize the responsibility of users to implement appropriate safeguards.
99
+
100
+ At Pleias, we continue to research and develop improved methods for creating safer and more equitable models and datasets. This includes ongoing work in toxicity reduction, bias mitigation, and the development of more sophisticated ethical filtering techniques.
101
+
102
+ ## Acknowledgements
103
+
104
+ This work would not have been possible without the substantial support and technical expertise from TractoAI, a serverless AI platform for running data and compute-intensive workloads at scale.
105
+
106
+ We are deeply grateful to the Mozilla Foundation Local AI Program for their generous support.
107
+
108
+ Finally, we acknowledge the significant contributions from the open science LLM community, particularly HuggingFace, Eleuther AI and Allen AI whose insights and cooperation have been invaluable to our work.
109
+
110
+ ## Future updates
111
+ Pleias-nano-1.2b-RAG will be continuously improved through iterative retraining/adaptation.
112
+
113
+ The current roadmap includes the following features:
114
+ * Context length expansion.
115
+ * Better handling of multilingual sources. In its current form, Pleias-nano-1.2b-RAG will generally switch language if a query is made to sources in a different language.
116
+ * New sampling methods inspired by Entropix for a better combined support of text creativity and accuracy.
117
+ * Interactive/conversational RAG.
118
+
119
+ End users are encouraged to update to the latest version whenever possible.
config.json CHANGED
@@ -6,6 +6,7 @@
6
  "attention_dropout": 0.0,
7
  "bos_token_id": 1,
8
  "eos_token_id": 2,
 
9
  "hidden_act": "silu",
10
  "hidden_size": 2048,
11
  "initializer_range": 0.02,
@@ -22,7 +23,7 @@
22
  "rope_theta": 10000,
23
  "tie_word_embeddings": true,
24
  "torch_dtype": "bfloat16",
25
- "transformers_version": "4.44.2",
26
  "use_cache": true,
27
  "vocab_size": 65536
28
  }
 
6
  "attention_dropout": 0.0,
7
  "bos_token_id": 1,
8
  "eos_token_id": 2,
9
+ "head_dim": 64,
10
  "hidden_act": "silu",
11
  "hidden_size": 2048,
12
  "initializer_range": 0.02,
 
23
  "rope_theta": 10000,
24
  "tie_word_embeddings": true,
25
  "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.46.0",
27
  "use_cache": true,
28
  "vocab_size": 65536
29
  }
generation_config.json CHANGED
@@ -2,5 +2,5 @@
2
  "_from_model_config": true,
3
  "bos_token_id": 1,
4
  "eos_token_id": 2,
5
- "transformers_version": "4.44.2"
6
  }
 
2
  "_from_model_config": true,
3
  "bos_token_id": 1,
4
  "eos_token_id": 2,
5
+ "transformers_version": "4.46.0"
6
  }
pleias_1b_base/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
pleias_1b_base/README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - PleIAs/common_corpus
5
+ language:
6
+ - en
7
+ - fr
8
+ - es
9
+ - de
10
+ - it
11
+ - la
12
+ - nl
13
+ - pl
14
+ ---
15
+ <div style="text-align: center;">
16
+ <img src="https://raw.githubusercontent.com/Pleias/logos/d6152d7943905da32a1e04fdfd7708ed9c7eed5e/PleIAs%201_0%20Full%20Logo%20(Black).png" style="width: 80%; margin: 0 auto; display: inline-block;"/>
17
+ </div>
18
+
19
+ **Pleias-nano-1.2b-Preview** is an early preview of a 1.21 billion parameters base model trained by [Pleias](https://huggingface.co/PleIAs) with [Tracto AI](https://tracto.ai/) on [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus).
20
+
21
+ Like all the base and specialized models from Pleias, Pleias-nano-1.2b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.
22
+
23
+ ## Description
24
+ Pleias-nano-1.2b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.
25
+
26
+ It includes the following features, that would apply to any responsibly trained variant:
27
+ * Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
28
+ * Extensive multilingual support for main European languages.
29
+ * A new tokenizer designed for enhanced document processing tasks and better multilingual support.
30
+ * Extremely low level of toxicity and problematic content.
31
+
32
+ Pleias-nano-1.2b-Preview has demonstrated unusual abilities for multilingual generation in its size range. Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.
33
+
34
+ Given its size, Pleias-nano-1.2b-Preview can run on CPU without any compression loss. We provide a first GGUF variant as part of our release.
35
+
36
+ ## Recommended use
37
+ As a base model, Pleias-nano-1.2b-Preview is only able to run continuation prompts.
38
+
39
+ Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.2).
40
+
41
+ Pleias-nano-1.2b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.
42
+
43
+ ## Example
44
+
45
+
46
+ ## Training
47
+ Pleias-nano-1.2b-Preview was fully pretrained on TractoAI on ISEG GPU cluster by Nebius AI on 192 h100s for 5 days. Pretraining code relied on [the fork of Nanotron developed by TractoAI](https://github.com/tractoai/nanotron). We provide the complete settings as a yaml file as part of our release.
48
+
49
+ Training schedule includes 518,000 steps (batch size 1,024) on over three epochs (nearly 5 trillions tokens):
50
+ * A lightly filtered version of Common Corpus (1.6 trillion tokens)
51
+ * A filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).
52
+ * A repeat of the previous set.
53
+
54
+ Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 4 tons CO2eq for training.
55
+
56
+ ## Ethical Considerations
57
+
58
+ pleias-1.B-Base model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.
59
+
60
+ To address this, we implemented a systematic ethical filtering process using toxicity classifiers to identify extremely harmful content. We also employed synthetic rewriting techniques to transform mildly problematic passages while preserving the underlying informational value. This process significantly reduced potential societal harm without compromising the dataset's size or textual quality, resulting in notably low toxicity scores in benchmarks compared to other models.
61
+
62
+ Despite these preventive measures, users should be aware that the model has not undergone additional safety alignment procedures and may still produce problematic outputs. The model's capabilities in generative AI tasks must be balanced against the risks of bias, misinformation propagation, and autonomous decision-making challenges. We explicitly prohibit any malicious utilization and emphasize the responsibility of users to implement appropriate safeguards.
63
+
64
+ At Pleias, we continue to research and develop improved methods for creating safer and more equitable models and datasets. This includes ongoing work in toxicity reduction, bias mitigation, and the development of more sophisticated ethical filtering techniques.
65
+
66
+ ## Acknowledgements
67
+
68
+ This work would not have been possible without the substantial support and technical expertise from TractoAI, a serverless AI platform for running data and compute-intensive workloads at scale.
69
+
70
+ We are deeply grateful to the Mozilla Foundation Local AI Program for their generous support.
71
+
72
+ Finally, we acknowledge the significant contributions from the open science LLM community, particularly HuggingFace, Eleuther AI and Allen AI whose insights and cooperation have been invaluable to our work.
73
+
74
+ ## Update
75
+ Pleias-1.2b-Preview is currently released as an early preview.
76
+
77
+ The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.
pleias_1b_base/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "head_dim": 64,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 2048,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 6144,
14
+ "max_position_embeddings": 2048,
15
+ "mlp_bias": false,
16
+ "model_type": "llama",
17
+ "num_attention_heads": 32,
18
+ "num_hidden_layers": 22,
19
+ "num_key_value_heads": 8,
20
+ "pretraining_tp": 1,
21
+ "rms_norm_eps": 1e-05,
22
+ "rope_scaling": null,
23
+ "rope_theta": 10000,
24
+ "tie_word_embeddings": true,
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.46.0",
27
+ "use_cache": true,
28
+ "vocab_size": 65536
29
+ }
pleias_1b_base/generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.46.0"
6
+ }
pleias_1b_base/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20211134a3a0ffb8efbb3c50443dc57f08390d7a4b1957f5d95896549333c2c4
3
+ size 2390960616
pleias_1b_base/special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
pleias_1b_base/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
pleias_1b_base/tokenizer_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[UNK]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<|begin_of_text|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<|end_of_text|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[PAD]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "clean_up_tokenization_spaces": true,
37
+ "model_max_length": 1000000000000000019884624838656,
38
+ "tokenizer_class": "PreTrainedTokenizerFast"
39
+ }
special_tokens_map.json CHANGED
@@ -1,15 +1,30 @@
1
  {
2
  "additional_special_tokens": [
3
- "<|source_id|>",
 
 
 
4
  "<|source_analysis_start|>",
5
  "<|source_analysis_end|>",
6
  "<|source_start|>",
7
  "<|source_end|>",
 
 
8
  "<|answer_start|>",
9
  "<|answer_end|>",
10
- "<|query_start|>",
11
- "<|query_end|>",
 
 
 
 
 
 
 
 
12
  "<|source_interpretation_start|>",
13
- "<|source_interpretation_end|>"
 
 
14
  ]
15
- }
 
1
  {
2
  "additional_special_tokens": [
3
+ "<|tool_call_start|>",
4
+ "<|tool_call_end|>",
5
+ "<|tool_list_start|>",
6
+ "<|tool_list_end|>",
7
  "<|source_analysis_start|>",
8
  "<|source_analysis_end|>",
9
  "<|source_start|>",
10
  "<|source_end|>",
11
+ "<|im_start|>",
12
+ "<|im_end|>",
13
  "<|answer_start|>",
14
  "<|answer_end|>",
15
+ "<|text_start|>",
16
+ "<|text_end|>",
17
+ "<|translation_start|>",
18
+ "<|translation_end|>",
19
+ "<|back_translation_start|>",
20
+ "<|back_translation_end|>",
21
+ "<|ocr_correction_start|>",
22
+ "<|ocr_correction_end|>",
23
+ "<|json_scheme_start|>",
24
+ "<|json_scheme_end|>",
25
  "<|source_interpretation_start|>",
26
+ "<|source_interpretation_end|>",
27
+ "<|query_start|>",
28
+ "<|query_end|>"
29
  ]
30
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -32,8 +32,88 @@
32
  "single_word": false,
33
  "special": true
34
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  "65520": {
36
- "content": "<|source_id|>",
37
  "lstrip": false,
38
  "normalized": false,
39
  "rstrip": false,
@@ -41,7 +121,7 @@
41
  "special": true
42
  },
43
  "65521": {
44
- "content": "<|source_analysis_start|>",
45
  "lstrip": false,
46
  "normalized": false,
47
  "rstrip": false,
@@ -49,7 +129,7 @@
49
  "special": true
50
  },
51
  "65522": {
52
- "content": "<|source_analysis_end|>",
53
  "lstrip": false,
54
  "normalized": false,
55
  "rstrip": false,
@@ -57,7 +137,7 @@
57
  "special": true
58
  },
59
  "65523": {
60
- "content": "<|source_start|>",
61
  "lstrip": false,
62
  "normalized": false,
63
  "rstrip": false,
@@ -65,7 +145,7 @@
65
  "special": true
66
  },
67
  "65524": {
68
- "content": "<|source_end|>",
69
  "lstrip": false,
70
  "normalized": false,
71
  "rstrip": false,
@@ -73,7 +153,7 @@
73
  "special": true
74
  },
75
  "65525": {
76
- "content": "<|answer_start|>",
77
  "lstrip": false,
78
  "normalized": false,
79
  "rstrip": false,
@@ -81,7 +161,7 @@
81
  "special": true
82
  },
83
  "65526": {
84
- "content": "<|answer_end|>",
85
  "lstrip": false,
86
  "normalized": false,
87
  "rstrip": false,
@@ -89,7 +169,7 @@
89
  "special": true
90
  },
91
  "65527": {
92
- "content": "<|query_start|>",
93
  "lstrip": false,
94
  "normalized": false,
95
  "rstrip": false,
@@ -97,7 +177,7 @@
97
  "special": true
98
  },
99
  "65528": {
100
- "content": "<|query_end|>",
101
  "lstrip": false,
102
  "normalized": false,
103
  "rstrip": false,
@@ -105,7 +185,7 @@
105
  "special": true
106
  },
107
  "65529": {
108
- "content": "<|source_interpretation_start|>",
109
  "lstrip": false,
110
  "normalized": false,
111
  "rstrip": false,
@@ -113,28 +193,83 @@
113
  "special": true
114
  },
115
  "65530": {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  "content": "<|source_interpretation_end|>",
117
  "lstrip": false,
118
  "normalized": false,
119
  "rstrip": false,
120
  "single_word": false,
121
  "special": true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  }
123
  },
124
  "additional_special_tokens": [
125
- "<|source_id|>",
 
 
 
126
  "<|source_analysis_start|>",
127
  "<|source_analysis_end|>",
128
  "<|source_start|>",
129
  "<|source_end|>",
 
 
130
  "<|answer_start|>",
131
  "<|answer_end|>",
132
- "<|query_start|>",
133
- "<|query_end|>",
 
 
 
 
 
 
 
 
134
  "<|source_interpretation_start|>",
135
- "<|source_interpretation_end|>"
 
 
136
  ],
137
  "clean_up_tokenization_spaces": true,
138
  "model_max_length": 1000000000000000019884624838656,
139
  "tokenizer_class": "PreTrainedTokenizerFast"
140
- }
 
32
  "single_word": false,
33
  "special": true
34
  },
35
+ "65510": {
36
+ "content": "<|tool_call_start|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "65511": {
44
+ "content": "<|tool_call_end|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "65512": {
52
+ "content": "<|tool_list_start|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "65513": {
60
+ "content": "<|tool_list_end|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "65514": {
68
+ "content": "<|source_analysis_start|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "65515": {
76
+ "content": "<|source_analysis_end|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "65516": {
84
+ "content": "<|source_start|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "65517": {
92
+ "content": "<|source_end|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "65518": {
100
+ "content": "<|im_start|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "65519": {
108
+ "content": "<|im_end|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
  "65520": {
116
+ "content": "<|answer_start|>",
117
  "lstrip": false,
118
  "normalized": false,
119
  "rstrip": false,
 
121
  "special": true
122
  },
123
  "65521": {
124
+ "content": "<|answer_end|>",
125
  "lstrip": false,
126
  "normalized": false,
127
  "rstrip": false,
 
129
  "special": true
130
  },
131
  "65522": {
132
+ "content": "<|text_start|>",
133
  "lstrip": false,
134
  "normalized": false,
135
  "rstrip": false,
 
137
  "special": true
138
  },
139
  "65523": {
140
+ "content": "<|text_end|>",
141
  "lstrip": false,
142
  "normalized": false,
143
  "rstrip": false,
 
145
  "special": true
146
  },
147
  "65524": {
148
+ "content": "<|translation_start|>",
149
  "lstrip": false,
150
  "normalized": false,
151
  "rstrip": false,
 
153
  "special": true
154
  },
155
  "65525": {
156
+ "content": "<|translation_end|>",
157
  "lstrip": false,
158
  "normalized": false,
159
  "rstrip": false,
 
161
  "special": true
162
  },
163
  "65526": {
164
+ "content": "<|back_translation_start|>",
165
  "lstrip": false,
166
  "normalized": false,
167
  "rstrip": false,
 
169
  "special": true
170
  },
171
  "65527": {
172
+ "content": "<|back_translation_end|>",
173
  "lstrip": false,
174
  "normalized": false,
175
  "rstrip": false,
 
177
  "special": true
178
  },
179
  "65528": {
180
+ "content": "<|ocr_correction_start|>",
181
  "lstrip": false,
182
  "normalized": false,
183
  "rstrip": false,
 
185
  "special": true
186
  },
187
  "65529": {
188
+ "content": "<|ocr_correction_end|>",
189
  "lstrip": false,
190
  "normalized": false,
191
  "rstrip": false,
 
193
  "special": true
194
  },
195
  "65530": {
196
+ "content": "<|json_scheme_start|>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "65531": {
204
+ "content": "<|json_scheme_end|>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "65532": {
212
+ "content": "<|source_interpretation_start|>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "65533": {
220
  "content": "<|source_interpretation_end|>",
221
  "lstrip": false,
222
  "normalized": false,
223
  "rstrip": false,
224
  "single_word": false,
225
  "special": true
226
+ },
227
+ "65534": {
228
+ "content": "<|query_start|>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "65535": {
236
+ "content": "<|query_end|>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
  }
243
  },
244
  "additional_special_tokens": [
245
+ "<|tool_call_start|>",
246
+ "<|tool_call_end|>",
247
+ "<|tool_list_start|>",
248
+ "<|tool_list_end|>",
249
  "<|source_analysis_start|>",
250
  "<|source_analysis_end|>",
251
  "<|source_start|>",
252
  "<|source_end|>",
253
+ "<|im_start|>",
254
+ "<|im_end|>",
255
  "<|answer_start|>",
256
  "<|answer_end|>",
257
+ "<|text_start|>",
258
+ "<|text_end|>",
259
+ "<|translation_start|>",
260
+ "<|translation_end|>",
261
+ "<|back_translation_start|>",
262
+ "<|back_translation_end|>",
263
+ "<|ocr_correction_start|>",
264
+ "<|ocr_correction_end|>",
265
+ "<|json_scheme_start|>",
266
+ "<|json_scheme_end|>",
267
  "<|source_interpretation_start|>",
268
+ "<|source_interpretation_end|>",
269
+ "<|query_start|>",
270
+ "<|query_end|>"
271
  ],
272
  "clean_up_tokenization_spaces": true,
273
  "model_max_length": 1000000000000000019884624838656,
274
  "tokenizer_class": "PreTrainedTokenizerFast"
275
+ }