chrisociepa commited on
Commit
4ef1d49
1 Parent(s): e9fca89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -11
README.md CHANGED
@@ -10,15 +10,19 @@ inference:
10
  temperature: 0.7
11
  ---
12
 
 
 
 
 
13
  # Bielik-7B-v0.1
14
 
15
- The Bielik-7B-v0.1 is a generative text model featuring 7 billion parameters, meticulously evolved from its predecessor, the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), through the processing of over 70 billion tokens. This model stands as a testament to the unique collaboration between the open-science project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, meticulously collected and processed by the SpeakLeash team, this endeavor leverages Poland's large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The training of the Bielik-7B-v0.1 was propelled by the support of computational grant number PLG/2024/016951, conducted on the Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.
16
 
17
  ## Model
18
 
19
- Bielik-7B-v0.1 has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo) implemented by [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/). This framework allows users to train language models with architecture similar to LLaMA and Mistral in a fast and efficient way.
20
 
21
- The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 GH200 cards and achieving a throughput exceeding 9200 tokens/gpu/second.
22
 
23
  The training dataset was composed of Polish texts collected and made available through the [SpeakLeash](https://speakleash.org/) project. We used over 36 billion tokens for two epochs of training.
24
 
@@ -29,10 +33,11 @@ The training dataset was composed of Polish texts collected and made available t
29
  * **Model type:** causal decoder-only
30
  * **Adopted from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
31
  * **License:** Apache 2.0 (commercial use allowed)
 
32
 
33
  ### Quality evaluation
34
 
35
- A XGBoost classification model was prepared and created to evaluate the quality of texts in Polish based on 93 features, such as the ratio of out-of-vocabulary words to all words (oovs), the number of nouns, verbs, or the average sentence length etc.. The model's output indicates the category (HIGH, MEDIUM, LOW) along with the % probability, which allows for the implementation of effective filters for analyzing texts only with a high quality index (HIGH > 90%).
36
 
37
  This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes.
38
 
@@ -104,11 +109,53 @@ for seq in sequences:
104
  Generated output:
105
  > Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami.
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  ## Limitations and Biases
108
 
109
  Bielik-7B-v0.1 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
110
 
111
- Bielik-7B-v0.1 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Bielik-7B-v0.1 was trained on various public datasets. While great efforts have been taken to clear the pretraining data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.
112
 
113
  ## License
114
 
@@ -130,13 +177,35 @@ Please cite this model using the following format:
130
 
131
  ## Responsible for training the model
132
 
133
- * Krzysztof Ociepa - team leadership, conceptualizing, data preparation, process optimization, and oversight of training
134
- * Łukasz Flis - coordinating and supervising the training
135
- * Krzysztof Wróbel - benchmarks
136
- * Adrian Gwoździej - data cleaning and quality
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
- The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is also invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH.
139
 
140
  ## Contact Us
141
 
142
- If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our [Discord SpeakLeash](https://discord.gg/3G9DVM39).
 
10
  temperature: 0.7
11
  ---
12
 
13
+ <p align="center">
14
+ <img src="https://huggingface.co/speakleash/Bielik-7B-v0.1/raw/main/speakleash_cyfronet.png">
15
+ </p>
16
+
17
  # Bielik-7B-v0.1
18
 
19
+ The Bielik-7B-v0.1 is a generative text model featuring 7 billion parameters, meticulously evolved from its predecessor, the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), through processing of over 70 billion tokens. Forementioned model stands as a testament to the unique collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which has been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The creation and training of the Bielik-7B-v0.1 was propelled by the support of computational grant number PLG/2024/016951, conducted on the Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.
20
 
21
  ## Model
22
 
23
+ Bielik-7B-v0.1 has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo) implemented by [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/). This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way.
24
 
25
+ The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 NVidia GH200 cards while achieving a throughput exceeding 9200 tokens/gpu/second.
26
 
27
  The training dataset was composed of Polish texts collected and made available through the [SpeakLeash](https://speakleash.org/) project. We used over 36 billion tokens for two epochs of training.
28
 
 
33
  * **Model type:** causal decoder-only
34
  * **Adopted from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
35
  * **License:** Apache 2.0 (commercial use allowed)
36
+ * **Model ref:** speakleash:debfc8635c781358e8db833a333887a5
37
 
38
  ### Quality evaluation
39
 
40
+ A XGBoost classification model was prepared and created to evaluate the quality of texts in native Polish language. It is based on 93 features, such as the ratio of out-of-vocabulary words to all words (OOVs), the number of nouns, verbs, average sentence length etc. The model outputs the category of a given document (either HIGH, MEDIUM or LOW) along with the probability. This approach allows implementation of dedicated pipeline to choose documents, from which we've used entries with HIGH quality index and probability exceeding 90%.
41
 
42
  This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes.
43
 
 
109
  Generated output:
110
  > Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami.
111
 
112
+ ## Evaluation
113
+
114
+
115
+ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard) 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Here are presented:
116
+ - Average - average score among all tasks normalized by baseline scores
117
+ - Reranking - reranking task, commonly used in RAG
118
+ - Reader (Generator) - open book question answering task, commonly used in RAG
119
+ - Perplexity (lower is better) - as a bonus, does not correlate with other scores and should not be used for model comparison
120
+
121
+ Current scores of pretrained and continuously pretrained models according to Open PL LLM Leaderboard 5-shot
122
+
123
+
124
+ | | Average | RAG Reranking | RAG Reader | Perplexity |
125
+ |--------------------------------------------------------------------------------------|----------:|--------------:|-----------:|-----------:|
126
+ | **7B parameters models:** | | | | |
127
+ | Baseline (majority class) | 0.00 | 53.36 | - | - |
128
+ | OPI-PG/Qra-7b | 11.13 | 54.40 | 75.25 | 203.36 |
129
+ | meta-llama/Llama-2-7b-hf | 12.73 | 54.02 | 77.92 | 850.45 |
130
+ | internlm/internlm2-base-7b | 20.68 | 52.39 | 69.85 | 3110.92 |
131
+ | [Bielik-7B-v0.1](https://huggingface.co/speakleash/Bielik-7B-v0.1) | 29.38 | **62.13** | **88.39** | 123.31 |
132
+ | mistralai/Mistral-7B-v0.1 | 30.67 | 60.35 | 85.39 | 857.32 |
133
+ | internlm/internlm2-7b | 33.03 | 69.39 | 73.63 | 5498.23 |
134
+ | alpindale/Mistral-7B-v0.2-hf | 33.05 | 60.23 | 85.21 | 932.60 |
135
+ | speakleash/mistral-apt3-7B/spi-e0_hf | **35.50** | **62.14** | 87.48 | 132.78 |
136
+ | | | | | |
137
+ | **Models with different sizes:** | | | | |
138
+ | sdadas/polish-gpt2-xl (1.7B) | -23.22 | 48.07 | 3.04 | 160.95 |
139
+ | Azurro/APT3-1B-Base (1B) | -8.23 | 51.49 | 18.94 | 249.90 |
140
+ | OPI-PG/Qra-1b (1B) | -5.44 | 47.65 | 38.51 | 398.96 |
141
+ | internlm/internlm2-1_8b (1.8B) | -2.78 | 49.37 | 31.88 | 60296.30 |
142
+ | OPI-PG/Qra-13b (13B) | 29.03 | 53.28 | 83.03 | 168.66 |
143
+ | upstage/SOLAR-10.7B-v1.0 (10.7B) | 38.12 | 75.81 | 86.39 | 641.05 |
144
+ | | | | | |
145
+ | **Polish instruction fine-tuned models:** | | | | |
146
+ | szymonrucinski/Curie-7B-v1 | 26.72 | 55.58 | 85.19 | 389.17 |
147
+ | Voicelab/trurl-2-7b | 18.85 | 60.67 | 77.19 | 1098.88 |
148
+ | [Bielik-7B-Instruct-v0.1](https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1) | 39.28 | 61.89 | 86.00 | 277.92 |
149
+
150
+
151
+ As you can see, Bielik-7B-v0.1 does not have the best Average score, but it has some clear advantages, e.g. the best score in the RAG Reader task.
152
+
153
+
154
  ## Limitations and Biases
155
 
156
  Bielik-7B-v0.1 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
157
 
158
+ Bielik-7B-v0.1 can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-7B-v0.1 was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.
159
 
160
  ## License
161
 
 
177
 
178
  ## Responsible for training the model
179
 
180
+ * [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/)<sup>SpeakLeash</sup> - team leadership, conceptualizing, data preparation, process optimization and oversight of training
181
+ * [Łukasz Flis](https://www.linkedin.com/in/lukasz-flis-0a39631/)<sup>Cyfronet AGH</sup> - coordinating and supervising the training
182
+ * [Adrian Gwoździej](https://www.linkedin.com/in/adrgwo/)<sup>SpeakLeash</sup> - data cleaning and quality
183
+ * [Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/)<sup>SpeakLeash</sup> - benchmarks
184
+
185
+ The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model through their commitment to the open-science SpeakLeash project:
186
+ [Sebastian Kondracki](https://www.linkedin.com/in/sebastian-kondracki/),
187
+ [Maria Filipkowska](https://www.linkedin.com/in/maria-filipkowska/),
188
+ [Grzegorz Urbanowicz](https://www.linkedin.com/in/grzegorz-urbanowicz-05823469/),
189
+ [Szymon Baczyński](https://www.linkedin.com/in/szymon-baczynski/),
190
+ [Paweł Kiszczak](https://www.linkedin.com/in/paveu-kiszczak/),
191
+ [Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/),
192
+ [Paweł Cyrta](https://www.linkedin.com/in/cyrta),
193
+ [Jacek Chwiła](https://www.linkedin.com/in/jacek-chwila/),
194
+ [Jan Maria Kowalski](https://www.linkedin.com/in/janmariakowalski/),
195
+ [Karol Jezierski](https://www.linkedin.com/in/karol-jezierski/),
196
+ [Kamil Nonckiewicz](https://www.linkedin.com/in/kamil-nonckiewicz/),
197
+ [Izabela Babis](https://www.linkedin.com/in/izabela-babis-2274b8105/),
198
+ [Nina Babis](https://www.linkedin.com/in/nina-babis-00055a140/),
199
+ [Waldemar Boszko](https://www.linkedin.com/in/waldemarboszko),
200
+ [Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/),
201
+ [Piotr Rybak](https://www.linkedin.com/in/piotrrybak/)
202
+ and many other wonderful researchers and enthusiasts of the AI world.
203
+
204
+ Members of the ACK Cyfronet AGH team:
205
+ [Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/).
206
+
207
 
 
208
 
209
  ## Contact Us
210
 
211
+ If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our [Discord SpeakLeash](https://discord.gg/3G9DVM39).