chrisociepa commited on
Commit
3bb6e4e
1 Parent(s): 83c48b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md CHANGED
@@ -1,3 +1,142 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - pl
5
+ library_name: transformers
6
+ tags:
7
+ - continuously_pretrained
8
+ inference:
9
+ parameters:
10
+ temperature: 0.7
11
  ---
12
+
13
+ # Bielik-7B-v0.1
14
+
15
+ The Bielik-7B-v0.1 is a generative text model featuring 7 billion parameters, meticulously evolved from its predecessor, the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), through the processing of over 70 billion tokens. This model stands as a testament to the unique collaboration between the open-science project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, meticulously collected and processed by the SpeakLeash team, this endeavor leverages Poland's large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The training of the Bielik-7B-v0.1 was propelled by the support of computational grant number PLG/2024/016951, conducted on the Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.
16
+
17
+ ## Model
18
+
19
+ Bielik-7B-v0.1 has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo) implemented by [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/). This framework allows users to train language models with architecture similar to LLaMA and Mistral in a fast and efficient way.
20
+
21
+ The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 GH200 cards and achieving a throughput exceeding 9200 tokens/gpu/second.
22
+
23
+ The training dataset was composed of Polish texts collected and made available through the [SpeakLeash](https://speakleash.org/) project. We used over 36 billion tokens for two epochs of training.
24
+
25
+ ### Model description:
26
+
27
+ * **Developed by:** [SpeakLeash](https://speakleash.org/)
28
+ * **Language:** Polish
29
+ * **Model type:** causal decoder-only
30
+ * **Adopted from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
31
+ * **License:** Apache 2.0 (commercial use allowed)
32
+
33
+ ### Quality evaluation
34
+
35
+ A XGBoost classification model was prepared and created to evaluate the quality of texts in Polish based on 93 features, such as the ratio of out-of-vocabulary words to all words (oovs), the number of nouns, verbs, or the average sentence length etc.. The model's output indicates the category (HIGH, MEDIUM, LOW) along with the % probability, which allows for the implementation of effective filters for analyzing texts only with a high quality index (HIGH > 90%).
36
+
37
+ This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes.
38
+
39
+ ## Training
40
+
41
+ * Framework: [ALLaMo](https://github.com/chrisociepa/allamo)
42
+ * Visualizations: [W&B](https://wandb.ai)
43
+
44
+ <p align="center">
45
+ <img src="https://huggingface.co/speakleash/Bielik-7B-v0.1/raw/main/train_loss.png">
46
+ </p>
47
+ <p align="center">
48
+ <img src="https://huggingface.co/speakleash/Bielik-7B-v0.1/raw/main/train_ppl.png">
49
+ </p>
50
+ <p align="center">
51
+ <img src="https://huggingface.co/speakleash/Bielik-7B-v0.1/raw/main/train_acc.png">
52
+ </p>
53
+
54
+ ### Training hyperparameters:
55
+
56
+ | **Hyperparameter** | **Value** |
57
+ |-----------------------------|------------------|
58
+ | Micro Batch Size | 4 |
59
+ | Batch Size | 4194304 |
60
+ | Learning Rate (cosine) | 3e-05 -> 2e-05 |
61
+ | Warmup Iterations | 2000 |
62
+ | All Iterations | 17350 |
63
+ | Optimizer | AdamW |
64
+ | β1, β2 | 0.9, 0.95 |
65
+ | Adam_eps | 1e−8 |
66
+ | Weight Decay | 0.1 |
67
+ | Grad Clip | 1.0 |
68
+ | Precision | bfloat16 (mixed) |
69
+
70
+
71
+ ### Quickstart
72
+
73
+ This model can be easily loaded using the AutoModelForCausalLM functionality.
74
+
75
+ ```python
76
+ from transformers import AutoTokenizer, AutoModelForCausalLM
77
+
78
+ model_name = "speakleash/Bielik-7B-v0.1"
79
+
80
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
81
+ model = AutoModelForCausalLM.from_pretrained(model_name)
82
+ ```
83
+
84
+ In order to reduce the memory usage, you can use smaller precision (`bfloat16`).
85
+
86
+ ```python
87
+ import torch
88
+
89
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
90
+ ```
91
+
92
+ And then you can use Hugging Face Pipelines to generate text:
93
+
94
+ ```python
95
+ import transformers
96
+
97
+ text = "Najważniejszym celem człowieka na ziemi jest"
98
+
99
+ pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
100
+ sequences = pipeline(max_new_tokens=100, do_sample=True, top_k=50, eos_token_id=tokenizer.eos_token_id, text_inputs=text)
101
+ for seq in sequences:
102
+ print(f"Result: {seq['generated_text']}")
103
+ ```
104
+ Generated output:
105
+ > Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami.
106
+
107
+ ## Limitations and Biases
108
+
109
+ Bielik-7B-v0.1 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
110
+
111
+ Bielik-7B-v0.1 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Bielik-7B-v0.1 was trained on various public datasets. While great efforts have been taken to clear the pretraining data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.
112
+
113
+ ## License
114
+
115
+ The model is licensed under Apache 2.0, which allows for commercial use.
116
+
117
+ ## Citation
118
+ Please cite this model using the following format:
119
+
120
+ ```
121
+ @misc{Bielik7Bv01,
122
+ title = {Introducing Bielik-7B-v0.1: Polish Language Model},
123
+ author = {Ociepa, Krzysztof and Flis, Łukasz and Wróbel, Krzysztof and Gwoździej, Adrian and {SpeakLeash Team} and {Cyfronet Team}},
124
+ year = {2024},
125
+ url = {https://huggingface.co/speakleash/Bielik-7B-v0.1},
126
+ note = {Accessed: 2024-04-01}, % change this date
127
+ urldate = {2024-04-01} % change this date
128
+ }
129
+ ```
130
+
131
+ ## Responsible for training the model
132
+
133
+ * Krzysztof Ociepa - team leadership, conceptualizing, data preparation, process optimization, and oversight of training
134
+ * Łukasz Flis - coordinating and supervising the training
135
+ * Krzysztof Wróbel - benchmarks
136
+ * Adrian Gwoździej - data cleaning and quality
137
+
138
+ The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is also invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH.
139
+
140
+ ## Contact Us
141
+
142
+ If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our [Discord SpeakLeash](https://discord.gg/3G9DVM39).