Update README.md
Browse files
README.md
CHANGED
@@ -9,23 +9,21 @@ language:
|
|
9 |
- en
|
10 |
pipeline_tag: text-generation
|
11 |
---
|
12 |
-
|
13 |
# cerbero-7b Italian LLM π
|
14 |
|
15 |
-
> π’ **
|
16 |
-
|
17 |
|
18 |
<p align="center">
|
19 |
<img width="300" height="300" src="./README.md.d/cerbero.png">
|
20 |
</p>
|
21 |
|
22 |
-
Built on **mistral-7b
|
23 |
|
24 |
**cerbero-7b** is specifically crafted to fill the void in Italy's AI landscape.
|
25 |
|
26 |
A **cambrian explosion** of **Italian Language Models** is essential for building advanced AI architectures that can cater to the diverse needs of the population.
|
27 |
|
28 |
-
**cerbero-7b**, alongside companions like [**Camoscio**](https://github.com/teelinsan/camoscio) and [**Fauno**](https://github.com/RSTLess-research/Fauno-Italian-LLM), aims to kick-start this revolution in Italy, ushering in an era where sophisticated **AI solutions** can seamlessly interact with and understand the intricacies of the **Italian language**, thereby empowering **innovation** across **industries** and fostering a deeper **connection** between **technology** and the **people** it serves.
|
29 |
|
30 |
**cerbero-7b** is released under the **permissive** Apache 2.0 **license**, allowing **unrestricted usage**, even **for commercial applications**.
|
31 |
|
@@ -45,11 +43,11 @@ The name "Cerbero," inspired by the three-headed dog that guards the gates of th
|
|
45 |
## Training Details π
|
46 |
|
47 |
cerbero-7b is **fully fine-tuned**, distinguishing itself from LORA or QLORA fine-tunes.
|
48 |
-
The model is trained on an expansive Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat
|
49 |
|
50 |
### Dataset Composition π
|
51 |
|
52 |
-
We employed the [Fauno training dataset](https://github.com/RSTLess-research/Fauno-Italian-LLM). The training data covers a broad spectrum, incorporating:
|
53 |
|
54 |
- **Medical Data:** Capturing nuances in medical language. π©Ί
|
55 |
- **Technical Content:** Extracted from Stack Overflow to enhance the model's understanding of technical discourse. π»
|
@@ -83,7 +81,7 @@ prompt = """Questa Γ¨ una conversazione tra un umano ed un assistente AI.
|
|
83 |
|
84 |
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
|
85 |
with torch.no_grad():
|
86 |
-
output_ids = model.generate(input_ids, max_new_tokens=
|
87 |
|
88 |
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
89 |
print(generated_text)
|
|
|
9 |
- en
|
10 |
pipeline_tag: text-generation
|
11 |
---
|
|
|
12 |
# cerbero-7b Italian LLM π
|
13 |
|
14 |
+
> π’ **Cerbero-7b** is the first **100% Free** and Open Source **Italian Large Language Model** (LLM) ready to be used for **research** or **commercial applications**.
|
|
|
15 |
|
16 |
<p align="center">
|
17 |
<img width="300" height="300" src="./README.md.d/cerbero.png">
|
18 |
</p>
|
19 |
|
20 |
+
Built on [**mistral-7b**](https://mistral.ai/news/announcing-mistral-7b/), which outperforms Llama2 13B across all benchmarks and surpasses Llama1 34B in numerous metrics.
|
21 |
|
22 |
**cerbero-7b** is specifically crafted to fill the void in Italy's AI landscape.
|
23 |
|
24 |
A **cambrian explosion** of **Italian Language Models** is essential for building advanced AI architectures that can cater to the diverse needs of the population.
|
25 |
|
26 |
+
**cerbero-7b**, alongside companions like [**Camoscio**](https://github.com/teelinsan/camoscio) and [**Fauno**](https://github.com/RSTLess-research/Fauno-Italian-LLM), aims to help **kick-start** this **revolution** in Italy, ushering in an era where sophisticated **AI solutions** can seamlessly interact with and understand the intricacies of the **Italian language**, thereby empowering **innovation** across **industries** and fostering a deeper **connection** between **technology** and the **people** it serves.
|
27 |
|
28 |
**cerbero-7b** is released under the **permissive** Apache 2.0 **license**, allowing **unrestricted usage**, even **for commercial applications**.
|
29 |
|
|
|
43 |
## Training Details π
|
44 |
|
45 |
cerbero-7b is **fully fine-tuned**, distinguishing itself from LORA or QLORA fine-tunes.
|
46 |
+
The model is trained on an expansive Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of **8192 tokens**
|
47 |
|
48 |
### Dataset Composition π
|
49 |
|
50 |
+
We employed a **refined** version of the [Fauno training dataset](https://github.com/RSTLess-research/Fauno-Italian-LLM). The training data covers a broad spectrum, incorporating:
|
51 |
|
52 |
- **Medical Data:** Capturing nuances in medical language. π©Ί
|
53 |
- **Technical Content:** Extracted from Stack Overflow to enhance the model's understanding of technical discourse. π»
|
|
|
81 |
|
82 |
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
|
83 |
with torch.no_grad():
|
84 |
+
output_ids = model.generate(input_ids, max_new_tokens=128)
|
85 |
|
86 |
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
87 |
print(generated_text)
|